Skip to content

Data

Data modules specify data sources, as well as data loading and preprocessing for use with Tasks. They provide a simple interface for swapping data sources and re-using datasets for new workflows without any code changes, enabling rapid and reproducible experimentation. They are specified with the --data arguent in the CLI or in the data section of a configuration file.

Data modules can automatically load common data sources (json, tsv, txt, HuggingFace) and uncommon ones (h5ad, TileDB). They transform, split, and sample these sources for training with mgen fit, evaluation with mgen test/validate, and inference with mgen predict.

This reference overviews the available no-code data modules. If you would like to develop new datasets, see Experiment Design.

data:
  class_path: modelgenerator.data.DMSFitnessPrediction
  init_args:
    path: genbio-ai/ProteinGYM-DMS
    train_split_files:
    - indels/B1LPA6_ECOSM_Russ_2020_indels.tsv
    train_split_name: train
    random_seed: 42
    batch_size: 32
    cv_num_folds: 5
    cv_test_fold_id: 0
    cv_enable_val_fold: true
    cv_fold_id_col: fold_id
model:
  ...
trainer:
  ...

Note: Data modules are designed for use with a specific task, indicated in the class name.

DNA

modelgenerator.data.NTClassification

Bases: SequenceClassificationDataModule

Nucleotide Transformer benchmarks from InstaDeep.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

'InstaDeepAI/nucleotide_transformer_downstream_tasks'
config_name str

The name of the HF dataset configuration. Affects how the dataset is loaded.

'enhancers'
x_col Union

The name of the column(s) containing the sequences.

'sequence'
y_col Union

The name of the column(s) containing the labels.

'label'
rename_cols dict

A dictionary mapping the original column names to the new column names.

{'sequence': 'sequences'}
class_filter Union

Filter the dataset to only include samples with the specified class(es).

None
generate_uid bool

Whether to generate a unique ID for each sample.

False
train_split_name Optional

The name of the training split.

'train'
test_split_name Optional

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional

The name of the validation split.

None
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

modelgenerator.data.GUEClassification

Bases: SequenceClassificationDataModule

Genome Understanding Evaluation benchmarks for DNABERT-2 from the Liu Lab at Northwestern.

Note
  • Manuscript: DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome
  • Data Card: leannmlindsey/GUE
  • Configs:
    • emp_H3
    • emp_H3K14ac
    • emp_H3K36me3
    • emp_H3K4me1
    • emp_H3K4me2
    • emp_H3K4me3
    • emp_H3K79me3
    • emp_H3K9ac
    • emp_H4
    • emp_H4ac
    • human_tf_0
    • human_tf_1
    • human_tf_2
    • human_tf_3
    • human_tf_4
    • mouse_0
    • mouse_1
    • mouse_2
    • mouse_3
    • mouse_4
    • prom_300_all
    • prom_300_notata
    • prom_300_tata
    • prom_core_all
    • prom_core_notata
    • prom_core_tata
    • splice_reconstructed
    • virus_covid
    • virus_species_40
    • fungi_species_20
    • EPI_K562
    • EPI_HeLa-S3
    • EPI_NHEK
    • EPI_IMR90
    • EPI_HUVEC

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

'leannmlindsey/GUE'
config_name str

The name of the HF dataset configuration. Affects how the dataset is loaded.

'emp_H3'
x_col Union

The name of the column(s) containing the sequences.

'sequence'
y_col Union

The name of the column(s) containing the labels.

'label'
rename_cols dict

A dictionary mapping the original column names to the new column names.

{'sequence': 'sequences'}
class_filter Union

Filter the dataset to only include samples with the specified class(es).

None
generate_uid bool

Whether to generate a unique ID for each sample.

False
train_split_name Optional

The name of the training split.

'train'
test_split_name Optional

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional

The name of the validation split.

None
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

modelgenerator.data.ClinvarRetrieve

Bases: ZeroshotClassificationRetrieveDataModule

ClinVar dataset for genomic variant effect prediction.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

None
test_split_files List

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

['ClinVar_Processed.tsv']
reference_file str

The file path to the reference file for retrieving sequences

'hg38.ml.fa'
method str

method mode to compute metrics

'Distance'
window int

The number of token taken on either side of the mutation site. The processed sequence length is 2 * window + 1

512
config_name Optional

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
test_split_name Optional

The name of the test split. Also used for mgen predict.

'test'
index_cols List

The list of the column name containing the index for sequence retrieval.

['chrom', 'start', 'end', 'ref', 'mutate']
y_col str

The name of the column containing the labels. Defaults to "label".

'label'
train_split_name Optional

The name of the training split.

'train'
valid_split_name Optional

The name of the validation split.

None
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

modelgenerator.data.PromoterExpressionRegression

Bases: SequenceRegressionDataModule

Gene expression prediction from promoter sequences from the Regev Lab at the Broad Institute.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

'genbio-ai/100M-random-promoters'
x_col Union

The name of column(s) containing the sequences.

'sequence'
y_col Union

The name of columns(s) containing the labels.

'label'
rename_cols dict

A dictionary mapping the original column names to the new column names.

{'sequence': 'sequences'}
normalize bool

Whether to normalize the labels.

True
config_name Optional

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
generate_uid bool

Whether to generate a unique ID for each sample.

False
train_split_name Optional

The name of the training split.

'train'
test_split_name Optional

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional

The name of the validation split.

None
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

modelgenerator.data.PromoterExpressionGeneration

Bases: ConditionalDiffusionDataModule

Promoter generation from gene expression data from the Regev Lab at the Broad Institute.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

'genbio-ai/100M-random-promoters'
x_col Union

The name of column(s) containing the sequences.

'sequence'
y_col Union

The name of columns(s) containing the labels.

'label'
rename_cols dict

A dictionary mapping the original column names to the new column names.

{'sequence': 'sequences'}
normalize bool

Whether to normalize the labels.

True
config_name Optional

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
generate_uid bool

Whether to generate a unique ID for each sample.

False
timesteps_per_sample int

The number of timesteps per sample.

10
randomize_targets bool

Whether to randomize the target sequences for each timestep (experimental efficiency boost).

False
train_split_name Optional

The name of the training split.

'train'
test_split_name Optional

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional

The name of the validation split.

None
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

modelgenerator.data.DependencyMappingDataModule

Bases: SequencesDataModule

Data module for doing dependency mapping via in silico mutagenesis on a dataset of sequences. Only uses the test set.

Note

Each sample includes a single sequence under key 'sequences' and optionally an 'ids' to track outputs.

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

required
vocab_file str

The path to the file with the vocabulary to mutate.

required
config_name Optional

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
test_split_name Optional

The name of the test split. Also used for mgen predict.

'test'
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
x_col str

The name of the column containing the sequences. Defaults to "sequence".

'sequence'
id_col str

The name of the column containing the ids. Defaults to "id".

'id'
train_split_name Optional

The name of the training split.

'train'
valid_split_name Optional

The name of the validation split.

None
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

RNA

modelgenerator.data.TranslationEfficiency

Bases: SequenceRegressionDataModule

Translation efficiency prediction benchmarks from the Wang Lab at Princeton.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

'genbio-ai/rna-downstream-tasks'
config_name str

The name of the HF dataset configuration. Affects how the dataset is loaded.

'translation_efficiency_Muscle'
x_col

The name of column(s) containing the sequences.

'sequences'
y_col

The name of columns(s) containing the labels.

'labels'
normalize bool

Whether to normalize the labels.

True
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

10
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_fold_id_col str

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

'fold_id'
valid_split_name str

The name of the validation split.

None
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0
test_split_name str

The name of the test split. Also used for mgen predict.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0
rename_cols dict[str, str] | None

A dictionary mapping the original column names to the new column names.

None
generate_uid bool

Whether to generate a unique ID for each sample.

False
train_split_name Optional

The name of the training split.

'train'
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

modelgenerator.data.ExpressionLevel

Bases: SequenceRegressionDataModule

Expression level prediction benchmarks from the Wang Lab at Princeton.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

'genbio-ai/rna-downstream-tasks'
config_name str

The name of the HF dataset configuration. Affects how the dataset is loaded.

'expression_Muscle'
x_col Union

The name of column(s) containing the sequences.

'sequences'
y_col Union

The name of columns(s) containing the labels.

'labels'
normalize bool

Whether to normalize the labels.

True
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

10
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_fold_id_col str

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

'fold_id'
valid_split_name str

The name of the validation split.

None
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0
test_split_name str

The name of the test split. Also used for mgen predict.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0
rename_cols dict[str, str] | None

A dictionary mapping the original column names to the new column names.

None
generate_uid bool

Whether to generate a unique ID for each sample.

False
train_split_name Optional

The name of the training split.

'train'
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

modelgenerator.data.TranscriptAbundance

Bases: SequenceRegressionDataModule

Transcript abundance prediction benchmarks from the Wang Lab at Princeton.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

'genbio-ai/rna-downstream-tasks'
config_name str

The name of the HF dataset configuration. Affects how the dataset is loaded.

'transcript_abundance_athaliana'
x_col Union

The name of column(s) containing the sequences.

'sequences'
y_col Union

The name of columns(s) containing the labels.

'labels'
normalize bool

Whether to normalize the labels.

True
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

5
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_fold_id_col str

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

'fold_id'
valid_split_name str

The name of the validation split.

None
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0
test_split_name str

The name of the test split. Also used for mgen predict.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0
rename_cols dict[str, str] | None

A dictionary mapping the original column names to the new column names.

None
generate_uid bool

Whether to generate a unique ID for each sample.

False
train_split_name Optional

The name of the training split.

'train'
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

modelgenerator.data.ProteinAbundance

Bases: SequenceRegressionDataModule

Protein abundance prediction benchmarks from the Wang Lab at Princeton.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

'genbio-ai/rna-downstream-tasks'
config_name str

The name of the HF dataset configuration. Affects how the dataset is loaded.

'protein_abundance_athaliana'
x_col Union

The name of column(s) containing the sequences.

'sequences'
y_col Union

The name of columns(s) containing the labels.

'labels'
normalize bool

Whether to normalize the labels.

True
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

5
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_fold_id_col str

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

'fold_id'
valid_split_name str

The name of the validation split.

None
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0
test_split_name str

The name of the test split. Also used for mgen predict.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0
rename_cols dict[str, str] | None

A dictionary mapping the original column names to the new column names.

None
generate_uid bool

Whether to generate a unique ID for each sample.

False
train_split_name Optional

The name of the training split.

'train'
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

modelgenerator.data.NcrnaFamilyClassification

Bases: SequenceClassificationDataModule

Non-coding RNA family classification benchmarks from DPTechnology.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

'genbio-ai/rna-downstream-tasks'
config_name str

The name of the HF dataset configuration. Affects how the dataset is loaded.

'ncrna_family_bnoise0'
x_col Union

The name of the column(s) containing the sequences.

'sequences'
y_col Union

The name of the column(s) containing the labels.

'labels'
train_split_name str

The name of the training split.

'train'
valid_split_name str

The name of the validation split.

'validation'
test_split_name str

The name of the test split. Also used for mgen predict.

'test'
rename_cols dict[str, str] | None

A dictionary mapping the original column names to the new column names.

None
class_filter Union

Filter the dataset to only include samples with the specified class(es).

None
generate_uid bool

Whether to generate a unique ID for each sample.

False
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

modelgenerator.data.SpliceSitePrediction

Bases: SequenceClassificationDataModule

Splice site prediction benchmarks from the Thompson Lab at University of Strasbourg.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

'genbio-ai/rna-downstream-tasks'
config_name str

The name of the HF dataset configuration. Affects how the dataset is loaded.

'splice_site_acceptor'
x_col Union

The name of the column(s) containing the sequences.

'sequences'
y_col Union

The name of the column(s) containing the labels.

'labels'
train_split_name str

The name of the training split.

'train'
valid_split_name str

The name of the validation split.

'validation'
test_split_name str

The name of the test split. Also used for mgen predict.

'test_danio'
batch_size int

The batch size.

16
rename_cols dict[str, str] | None

A dictionary mapping the original column names to the new column names.

None
class_filter Union

Filter the dataset to only include samples with the specified class(es).

None
generate_uid bool

Whether to generate a unique ID for each sample.

False
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

modelgenerator.data.ModificationSitePrediction

Bases: SequenceClassificationDataModule

Modification site prediction benchmarks from the Meng Lab at the University of Liverpool.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

'genbio-ai/rna-downstream-tasks'
config_name str

The name of the HF dataset configuration. Affects how the dataset is loaded.

'modification_site'
x_col Union

The name of the column(s) containing the sequences.

'sequences'
y_col List

The name of the column(s) containing the labels.

['labels_0', 'labels_1', 'labels_2', 'labels_3', 'labels_4', 'labels_5', 'labels_6', 'labels_7', 'labels_8', 'labels_9', 'labels_10', 'labels_11']
train_split_name str

The name of the training split.

'train'
valid_split_name str

The name of the validation split.

'validation'
test_split_name str

The name of the test split. Also used for mgen predict.

'test'
rename_cols dict[str, str] | None

A dictionary mapping the original column names to the new column names.

None
class_filter Union

Filter the dataset to only include samples with the specified class(es).

None
generate_uid bool

Whether to generate a unique ID for each sample.

False
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

modelgenerator.data.RNAMeanRibosomeLoadDataModule

Bases: SequenceRegressionDataModule

Data module for the mean ribosome load dataset.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

'genbio-ai/rna-downstream-tasks'
config_name str

The name of the HF dataset configuration. Affects how the dataset is loaded.

'mean_ribosome_load'
train_split_name str

The name of the training split.

'train'
valid_split_name str

The name of the validation split.

'validation'
test_split_name str

The name of the test split. Also used for mgen predict.

'test'
x_col str

The name of column(s) containing the sequences.

'utr'
y_col str

The name of columns(s) containing the labels.

'rl'
rename_cols dict

A dictionary mapping the original column names to the new column names.

{'utr': 'sequences'}
normalize bool

Whether to normalize the labels.

False
generate_uid bool

Whether to generate a unique ID for each sample.

False
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

Protein

modelgenerator.data.ContactPredictionBinary

Bases: TokenClassificationDataModule

Protein contact prediction benchmarks from BioMap.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

'proteinglm/contact_prediction_binary'
pairwise bool

Whether the labels are pairwise.

True
x_col str

The name of the column containing the sequences.

'seq'
y_col str

The name of the column containing the labels.

'label'
rename_cols dict

A dictionary mapping the original column names to the new column names.

{'seq': 'sequences'}
batch_size int

The batch size.

1
max_context_length int

Maximum context length for the input sequences.

12800
msa_random_seed Optional

Random seed for MSA generation.

None
is_rag_dataset bool

Whether the dataset is a RAG dataset for AIDO.Protein-RAG.

False
config_name Optional

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
extra_cols Optional

Additional columns to include in the dataset.

None
max_length Optional

The maximum length of the sequences.

None
truncate_extra_cols bool

Whether to truncate the extra columns to the maximum length.

False
collate_fn Optional

The function to use for collating data.

None
generate_uid bool

Whether to generate a unique ID for each sample.

False
train_split_name Optional

The name of the training split.

'train'
test_split_name Optional

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional

The name of the validation split.

None
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

modelgenerator.data.SspQ3

Bases: TokenClassificationDataModule

Protein secondary structure prediction benchmarks from BioMap.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

'proteinglm/ssp_q3'
pairwise bool

Whether the labels are pairwise.

False
x_col str

The name of the column containing the sequences.

'seq'
y_col str

The name of the column containing the labels.

'label'
rename_cols dict

A dictionary mapping the original column names to the new column names.

{'seq': 'sequences'}
batch_size int

The batch size.

1
max_context_length int

Maximum context length for the input sequences.

12800
msa_random_seed Optional

Random seed for MSA generation.

None
is_rag_dataset bool

Whether the dataset is a RAG dataset for AIDO.Protein-RAG.

False
config_name Optional

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
extra_cols Optional

Additional columns to include in the dataset.

None
max_length Optional

The maximum length of the sequences.

None
truncate_extra_cols bool

Whether to truncate the extra columns to the maximum length.

False
collate_fn Optional

The function to use for collating data.

None
generate_uid bool

Whether to generate a unique ID for each sample.

False
train_split_name Optional

The name of the training split.

'train'
test_split_name Optional

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional

The name of the validation split.

None
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

modelgenerator.data.FoldPrediction

Bases: SequenceClassificationDataModule

Protein fold prediction benchmarks from BioMap.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

'proteinglm/fold_prediction'
x_col Union

The name of the column(s) containing the sequences.

'seq'
y_col Union

The name of the column(s) containing the labels.

'label'
rename_cols dict

A dictionary mapping the original column names to the new column names.

{'seq': 'sequences'}
max_context_length int

Maximum context length for the input sequences.

12800
msa_random_seed Optional

Random seed for MSA generation.

None
is_rag_dataset bool

Whether the dataset is a RAG dataset for AIDO.Protein-RAG.

False
config_name Optional

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
class_filter Union

Filter the dataset to only include samples with the specified class(es).

None
generate_uid bool

Whether to generate a unique ID for each sample.

False
train_split_name Optional

The name of the training split.

'train'
test_split_name Optional

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional

The name of the validation split.

None
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

modelgenerator.data.LocalizationPrediction

Bases: SequenceClassificationDataModule

Protein localization prediction benchmarks from BioMap.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

'proteinglm/localization_prediction'
x_col Union

The name of the column(s) containing the sequences.

'seq'
y_col Union

The name of the column(s) containing the labels.

'label'
rename_cols dict

A dictionary mapping the original column names to the new column names.

{'seq': 'sequences'}
config_name Optional

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
class_filter Union

Filter the dataset to only include samples with the specified class(es).

None
generate_uid bool

Whether to generate a unique ID for each sample.

False
train_split_name Optional

The name of the training split.

'train'
test_split_name Optional

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional

The name of the validation split.

None
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

modelgenerator.data.MetalIonBinding

Bases: SequenceClassificationDataModule

Metal ion binding prediction benchmarks from BioMap.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

'proteinglm/metal_ion_binding'
x_col Union

The name of the column(s) containing the sequences.

'seq'
y_col Union

The name of the column(s) containing the labels.

'label'
rename_cols dict

A dictionary mapping the original column names to the new column names.

{'seq': 'sequences'}
config_name Optional

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
class_filter Union

Filter the dataset to only include samples with the specified class(es).

None
generate_uid bool

Whether to generate a unique ID for each sample.

False
train_split_name Optional

The name of the training split.

'train'
test_split_name Optional

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional

The name of the validation split.

None
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

modelgenerator.data.SolubilityPrediction

Bases: SequenceClassificationDataModule

Protein solubility prediction benchmarks from BioMap.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

'proteinglm/solubility_prediction'
x_col Union

The name of the column(s) containing the sequences.

'seq'
y_col Union

The name of the column(s) containing the labels.

'label'
rename_cols dict

A dictionary mapping the original column names to the new column names.

{'seq': 'sequences'}
config_name Optional

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
class_filter Union

Filter the dataset to only include samples with the specified class(es).

None
generate_uid bool

Whether to generate a unique ID for each sample.

False
train_split_name Optional

The name of the training split.

'train'
test_split_name Optional

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional

The name of the validation split.

None
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

modelgenerator.data.AntibioticResistance

Bases: SequenceClassificationDataModule

Antibiotic resistance prediction benchmarks from BioMap.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

'proteinglm/antibiotic_resistance'
x_col Union

The name of the column(s) containing the sequences.

'seq'
y_col Union

The name of the column(s) containing the labels.

'label'
rename_cols dict

A dictionary mapping the original column names to the new column names.

{'seq': 'sequences'}
config_name Optional

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
class_filter Union

Filter the dataset to only include samples with the specified class(es).

None
generate_uid bool

Whether to generate a unique ID for each sample.

False
train_split_name Optional

The name of the training split.

'train'
test_split_name Optional

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional

The name of the validation split.

None
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

modelgenerator.data.CloningClf

Bases: SequenceClassificationDataModule

Cloning classification prediction benchmarks from BioMap.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

'proteinglm/cloning_clf'
x_col Union

The name of the column(s) containing the sequences.

'seq'
y_col Union

The name of the column(s) containing the labels.

'label'
rename_cols dict

A dictionary mapping the original column names to the new column names.

{'seq': 'sequences'}
config_name Optional

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
class_filter Union

Filter the dataset to only include samples with the specified class(es).

None
generate_uid bool

Whether to generate a unique ID for each sample.

False
train_split_name Optional

The name of the training split.

'train'
test_split_name Optional

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional

The name of the validation split.

None
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

modelgenerator.data.MaterialProduction

Bases: SequenceClassificationDataModule

Material production prediction benchmarks from BioMap.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

'proteinglm/material_production'
x_col Union

The name of the column(s) containing the sequences.

'seq'
y_col Union

The name of the column(s) containing the labels.

'label'
rename_cols dict

A dictionary mapping the original column names to the new column names.

{'seq': 'sequences'}
config_name Optional

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
class_filter Union

Filter the dataset to only include samples with the specified class(es).

None
generate_uid bool

Whether to generate a unique ID for each sample.

False
train_split_name Optional

The name of the training split.

'train'
test_split_name Optional

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional

The name of the validation split.

None
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

modelgenerator.data.TcrPmhcAffinity

Bases: SequenceClassificationDataModule

TCR-pMHC affinity prediction benchmarks from BioMap.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

'proteinglm/tcr_pmhc_affinity'
x_col Union

The name of the column(s) containing the sequences.

'seq'
y_col Union

The name of the column(s) containing the labels.

'label'
rename_cols dict

A dictionary mapping the original column names to the new column names.

{'seq': 'sequences'}
config_name Optional

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
class_filter Union

Filter the dataset to only include samples with the specified class(es).

None
generate_uid bool

Whether to generate a unique ID for each sample.

False
train_split_name Optional

The name of the training split.

'train'
test_split_name Optional

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional

The name of the validation split.

None
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

modelgenerator.data.PeptideHlaMhcAffinity

Bases: SequenceClassificationDataModule

Peptide-HLA-MHC affinity prediction benchmarks from BioMap. Note: - Manuscript: xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein - Data Card: proteinglm/peptide_HLA_MHC_affinity

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

'proteinglm/peptide_HLA_MHC_affinity'
x_col Union

The name of the column(s) containing the sequences.

'seq'
y_col Union

The name of the column(s) containing the labels.

'label'
rename_cols dict

A dictionary mapping the original column names to the new column names.

{'seq': 'sequences'}
config_name Optional

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
class_filter Union

Filter the dataset to only include samples with the specified class(es).

None
generate_uid bool

Whether to generate a unique ID for each sample.

False
train_split_name Optional

The name of the training split.

'train'
test_split_name Optional

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional

The name of the validation split.

None
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

modelgenerator.data.TemperatureStability

Bases: SequenceClassificationDataModule

Temperature stability prediction benchmarks from BioMap.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

'proteinglm/temperature_stability'
x_col Union

The name of the column(s) containing the sequences.

'seq'
y_col Union

The name of the column(s) containing the labels.

'label'
rename_cols dict

A dictionary mapping the original column names to the new column names.

{'seq': 'sequences'}
config_name Optional

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
class_filter Union

Filter the dataset to only include samples with the specified class(es).

None
generate_uid bool

Whether to generate a unique ID for each sample.

False
train_split_name Optional

The name of the training split.

'train'
test_split_name Optional

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional

The name of the validation split.

None
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

modelgenerator.data.FluorescencePrediction

Bases: SequenceRegressionDataModule

Fluorescence prediction benchmarks from BioMap.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

'proteinglm/fluorescence_prediction'
x_col Union

The name of column(s) containing the sequences.

'seq'
y_col Union

The name of columns(s) containing the labels.

'label'
rename_cols dict

A dictionary mapping the original column names to the new column names.

{'seq': 'sequences'}
normalize bool

Whether to normalize the labels.

True
max_context_length int

Maximum context length for the input sequences.

12800
msa_random_seed Optional

Random seed for MSA generation.

None
is_rag_dataset bool

Whether the dataset is a RAG dataset for AIDO.Protein-RAG.

False
config_name Optional

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
generate_uid bool

Whether to generate a unique ID for each sample.

False
train_split_name Optional

The name of the training split.

'train'
test_split_name Optional

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional

The name of the validation split.

None
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

modelgenerator.data.FitnessPrediction

Bases: SequenceRegressionDataModule

Fitness prediction benchmarks from BioMap.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

'proteinglm/fitness_prediction'
x_col Union

The name of column(s) containing the sequences.

'seq'
y_col Union

The name of columns(s) containing the labels.

'label'
rename_cols dict

A dictionary mapping the original column names to the new column names.

{'seq': 'sequences'}
normalize bool

Whether to normalize the labels.

True
config_name Optional

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
generate_uid bool

Whether to generate a unique ID for each sample.

False
train_split_name Optional

The name of the training split.

'train'
test_split_name Optional

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional

The name of the validation split.

None
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

modelgenerator.data.StabilityPrediction

Bases: SequenceRegressionDataModule

Stability prediction benchmarks from BioMap.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

'proteinglm/stability_prediction'
x_col Union

The name of column(s) containing the sequences.

'seq'
y_col Union

The name of columns(s) containing the labels.

'label'
rename_cols dict

A dictionary mapping the original column names to the new column names.

{'seq': 'sequences'}
normalize bool

Whether to normalize the labels.

True
config_name Optional

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
generate_uid bool

Whether to generate a unique ID for each sample.

False
train_split_name Optional

The name of the training split.

'train'
test_split_name Optional

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional

The name of the validation split.

None
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

modelgenerator.data.EnzymeCatalyticEfficiencyPrediction

Bases: SequenceRegressionDataModule

Enzyme catalytic efficiency prediction benchmarks from BioMap.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

'proteinglm/enzyme_catalytic_efficiency'
x_col Union

The name of column(s) containing the sequences.

'seq'
y_col Union

The name of columns(s) containing the labels.

'label'
rename_cols dict

A dictionary mapping the original column names to the new column names.

{'seq': 'sequences'}
normalize bool

Whether to normalize the labels.

True
config_name Optional

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
generate_uid bool

Whether to generate a unique ID for each sample.

False
train_split_name Optional

The name of the training split.

'train'
test_split_name Optional

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional

The name of the validation split.

None
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

modelgenerator.data.OptimalTemperaturePrediction

Bases: SequenceRegressionDataModule

Optimal temperature prediction benchmarks from BioMap.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

'proteinglm/optimal_temperature'
x_col Union

The name of column(s) containing the sequences.

'seq'
y_col Union

The name of columns(s) containing the labels.

'label'
rename_cols dict

A dictionary mapping the original column names to the new column names.

{'seq': 'sequences'}
normalize bool

Whether to normalize the labels.

True
config_name Optional

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
generate_uid bool

Whether to generate a unique ID for each sample.

False
train_split_name Optional

The name of the training split.

'train'
test_split_name Optional

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional

The name of the validation split.

None
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

modelgenerator.data.OptimalPhPrediction

Bases: SequenceRegressionDataModule

Optimal pH prediction benchmarks from BioMap.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

'proteinglm/optimal_ph'
x_col Union

The name of column(s) containing the sequences.

'seq'
y_col Union

The name of columns(s) containing the labels.

'label'
rename_cols dict

A dictionary mapping the original column names to the new column names.

{'seq': 'sequences'}
normalize bool

Whether to normalize the labels.

True
config_name Optional

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
generate_uid bool

Whether to generate a unique ID for each sample.

False
train_split_name Optional

The name of the training split.

'train'
test_split_name Optional

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional

The name of the validation split.

None
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

modelgenerator.data.DMSFitnessPrediction

Bases: SequenceRegressionDataModule

Deep mutational scanning (DMS) fitness prediction benchmarks from the Gal Lab at Oxford and the Marks Lab at Harvard.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

'genbio-ai/ProteinGYM-DMS'
train_split_files list

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

['indels/B1LPA6_ECOSM_Russ_2020_indels.tsv']
x_col Union

The name of column(s) containing the sequences.

'sequences'
y_col Union

The name of columns(s) containing the labels.

'labels'
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

5
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col str

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

'fold_id'
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

-1
valid_split_name str

The name of the validation split.

None
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0
test_split_name str

The name of the test split. Also used for mgen predict.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0
max_context_length int

Maximum context length for the input sequences.

12800
msa_random_seed Optional

Random seed for MSA generation.

None
is_rag_dataset bool

Whether the dataset is a RAG dataset for AIDO.Protein-RAG.

False
rename_cols dict[str, str] | None

A dictionary mapping the original column names to the new column names.

None
config_name Optional

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
normalize bool

Whether to normalize the labels.

True
generate_uid bool

Whether to generate a unique ID for each sample.

False
train_split_name Optional

The name of the training split.

'train'
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False

Structure

modelgenerator.data.ContactPredictionBinary

Bases: TokenClassificationDataModule

Protein contact prediction benchmarks from BioMap.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

'proteinglm/contact_prediction_binary'
pairwise bool

Whether the labels are pairwise.

True
x_col str

The name of the column containing the sequences.

'seq'
y_col str

The name of the column containing the labels.

'label'
rename_cols dict

A dictionary mapping the original column names to the new column names.

{'seq': 'sequences'}
batch_size int

The batch size.

1
max_context_length int

Maximum context length for the input sequences.

12800
msa_random_seed Optional

Random seed for MSA generation.

None
is_rag_dataset bool

Whether the dataset is a RAG dataset for AIDO.Protein-RAG.

False
config_name Optional

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
extra_cols Optional

Additional columns to include in the dataset.

None
max_length Optional

The maximum length of the sequences.

None
truncate_extra_cols bool

Whether to truncate the extra columns to the maximum length.

False
collate_fn Optional

The function to use for collating data.

None
generate_uid bool

Whether to generate a unique ID for each sample.

False
train_split_name Optional

The name of the training split.

'train'
test_split_name Optional

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional

The name of the validation split.

None
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

modelgenerator.data.SspQ3

Bases: TokenClassificationDataModule

Protein secondary structure prediction benchmarks from BioMap.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

'proteinglm/ssp_q3'
pairwise bool

Whether the labels are pairwise.

False
x_col str

The name of the column containing the sequences.

'seq'
y_col str

The name of the column containing the labels.

'label'
rename_cols dict

A dictionary mapping the original column names to the new column names.

{'seq': 'sequences'}
batch_size int

The batch size.

1
max_context_length int

Maximum context length for the input sequences.

12800
msa_random_seed Optional

Random seed for MSA generation.

None
is_rag_dataset bool

Whether the dataset is a RAG dataset for AIDO.Protein-RAG.

False
config_name Optional

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
extra_cols Optional

Additional columns to include in the dataset.

None
max_length Optional

The maximum length of the sequences.

None
truncate_extra_cols bool

Whether to truncate the extra columns to the maximum length.

False
collate_fn Optional

The function to use for collating data.

None
generate_uid bool

Whether to generate a unique ID for each sample.

False
train_split_name Optional

The name of the training split.

'train'
test_split_name Optional

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional

The name of the validation split.

None
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

modelgenerator.data.FoldPrediction

Bases: SequenceClassificationDataModule

Protein fold prediction benchmarks from BioMap.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

'proteinglm/fold_prediction'
x_col Union

The name of the column(s) containing the sequences.

'seq'
y_col Union

The name of the column(s) containing the labels.

'label'
rename_cols dict

A dictionary mapping the original column names to the new column names.

{'seq': 'sequences'}
max_context_length int

Maximum context length for the input sequences.

12800
msa_random_seed Optional

Random seed for MSA generation.

None
is_rag_dataset bool

Whether the dataset is a RAG dataset for AIDO.Protein-RAG.

False
config_name Optional

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
class_filter Union

Filter the dataset to only include samples with the specified class(es).

None
generate_uid bool

Whether to generate a unique ID for each sample.

False
train_split_name Optional

The name of the training split.

'train'
test_split_name Optional

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional

The name of the validation split.

None
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

modelgenerator.data.FluorescencePrediction

Bases: SequenceRegressionDataModule

Fluorescence prediction benchmarks from BioMap.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

'proteinglm/fluorescence_prediction'
x_col Union

The name of column(s) containing the sequences.

'seq'
y_col Union

The name of columns(s) containing the labels.

'label'
rename_cols dict

A dictionary mapping the original column names to the new column names.

{'seq': 'sequences'}
normalize bool

Whether to normalize the labels.

True
max_context_length int

Maximum context length for the input sequences.

12800
msa_random_seed Optional

Random seed for MSA generation.

None
is_rag_dataset bool

Whether the dataset is a RAG dataset for AIDO.Protein-RAG.

False
config_name Optional

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
generate_uid bool

Whether to generate a unique ID for each sample.

False
train_split_name Optional

The name of the training split.

'train'
test_split_name Optional

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional

The name of the validation split.

None
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

modelgenerator.data.DMSFitnessPrediction

Bases: SequenceRegressionDataModule

Deep mutational scanning (DMS) fitness prediction benchmarks from the Gal Lab at Oxford and the Marks Lab at Harvard.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

'genbio-ai/ProteinGYM-DMS'
train_split_files list

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

['indels/B1LPA6_ECOSM_Russ_2020_indels.tsv']
x_col Union

The name of column(s) containing the sequences.

'sequences'
y_col Union

The name of columns(s) containing the labels.

'labels'
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

5
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col str

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

'fold_id'
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

-1
valid_split_name str

The name of the validation split.

None
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0
test_split_name str

The name of the test split. Also used for mgen predict.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0
max_context_length int

Maximum context length for the input sequences.

12800
msa_random_seed Optional

Random seed for MSA generation.

None
is_rag_dataset bool

Whether the dataset is a RAG dataset for AIDO.Protein-RAG.

False
rename_cols dict[str, str] | None

A dictionary mapping the original column names to the new column names.

None
config_name Optional

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
normalize bool

Whether to normalize the labels.

True
generate_uid bool

Whether to generate a unique ID for each sample.

False
train_split_name Optional

The name of the training split.

'train'
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False

modelgenerator.data.StructureTokenDataModule

Bases: DataInterface, HFDatasetLoaderMixin

Test only data module for structure token predictors.

This data module is specifically designed for handling datasets uses amino acid sequences as input and structure tokens as labels.

Note

This module only supports testing and ignores training and validation splits. It assumes test split files contain sequences and optionally their structural token labels. If structural token labels are not provided, dummy labels are created.

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

required
config_name Optional

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
test_split_files Optional

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
batch_size int

The batch size.

1
train_split_name Optional

The name of the training split.

'train'
test_split_name Optional

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional

The name of the validation split.

None
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

Cell

modelgenerator.data.CellClassificationDataModule

Bases: DataInterface

Data module for cell classification.

Note

Each sample includes a feature vector (one of the rows in ) and a single class label (one of the columns in )

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

required
backbone_class_path Optional

Class path of the backbone model.

None
filter_columns Optional

The columns of we want to use. Defaults to None, in which case all columns are used.

None
rename_columns Optional

New name of columns. Defaults to None, in which case columns are not renamed. Does nothing if filter_colums is None.

None
config_name Optional

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
train_split_name Optional

The name of the training split.

'train'
test_split_name Optional

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional

The name of the validation split.

None
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

modelgenerator.data.CellClassificationLargeDataModule

Bases: DataInterface

Data module for cell classification. This class handles large dataset and is implemented based on TileDB.

Note

Each sample includes a feature vector (one of the rows in ) and a single class label (one of the columns in )

Parameters:

Name Type Description Default
path str

Path to the TileDB dataset folder

required
train_split_subfolder str

Subfolder name for the training split.

required
valid_split_subfolder str

Subfolder name for the validation split.

required
test_split_subfolder str

Subfolder name for the test split.

required
backbone_class_path Optional

Class path of the backbone model.

None
layer_name str

Name of the layer in the TileDB dataset.

'data'
obs_column_name str

Name of the column in to use as the label.

'cell_type'
measurement_name str

Name of the measurement in the TileDB dataset.

'RNA'
axis_query_value_filter Optional

Optional filter for the axis query.

None
prefetch_factor int

Number of batches to prefetch.

16
config_name Optional

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
train_split_name Optional

The name of the training split.

'train'
test_split_name Optional

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional

The name of the validation split.

None
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

modelgenerator.data.ClockDataModule

Bases: DataInterface

Data module for transcriptomic clock tasks.

Note

Each sample includes a feature vector (one of the rows in ) and a single scalar corresponding to donor age (one of the columns in )

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

required
split_column str

The column of that defines the split assignments.

required
label_scaling Optional

The type of label scaling to apply.

'z_scaling'
backbone_class_path Optional

Class path of the backbone model.

None
filter_columns Optional

The columns of we want to use. Defaults to None, in which case all columns are used.

None
rename_columns Optional

New name of columns. Defaults to None, in which case columns are not renamed. Does nothing if filter_colums is None.

None
config_name Optional

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
train_split_name Optional

The name of the training split.

'train'
test_split_name Optional

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional

The name of the validation split.

None
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

modelgenerator.data.PertClassificationDataModule

Bases: DataInterface

Data module for perturbation classification.

Note

Each sample includes a feature vector (one of the rows in ) and a single class label (one of the columns in )

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

required
pert_column str

Column of containing perturbation labels.

required
cell_line_column str

Column of containing cell line labels.

required
cell_line str

Name of cell line to consider.

required
split_seed int

Seed for train/val/test splits.

1234
train_frac float

Fraction of examples to assign to train set.

0.7
val_frac float

Fraction of examples to assign to val set.

0.15
test_frac float

Fraction of examples to assign to test set.

0.15
backbone_class_path Optional

Class path of the backbone model.

None
filter_columns Optional

The columns of we want to use. Defaults to None, in which case all columns are used.

None
rename_columns Optional

New name of columns. Defaults to None, in which case columns are not renamed. Does nothing if filter_colums is None.

None
config_name Optional

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
train_split_name Optional

The name of the training split.

'train'
test_split_name Optional

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional

The name of the validation split.

None
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

Tissue

modelgenerator.data.CellWithNeighborDataModule

Bases: DataInterface

Data module for cell classification with neighbors for AIDO.Tissue.

Note

Each sample includes a feature vector (one of the rows in ) and a single class label (one of the columns in ) The feature vector is concatenated with the feature vectors of its neighbors.

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

required
filter_columns Optional

The columns of we want to use. Defaults to None, in which case all columns are used.

None
rename_columns Optional

Optional list of columns to rename.

None
use_random_neighbor bool

Whether to use random neighbors.

False
copy_center_as_neighbor bool

Whether to copy center as a neighbor.

False
neighbor_num int

Number of neighbors to consider.

10
generate_uid bool

Whether to generate a unique identifier.

False
config_name Optional

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
train_split_name Optional

The name of the training split.

'train'
test_split_name Optional

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional

The name of the validation split.

None
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

Multimodal

modelgenerator.data.IsoformExpression

Bases: SequenceRegressionDataModule

Isoform expression prediction benchmarks from the

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

'genbio-ai/transcript_isoform_expression_prediction'
config_name str

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
x_col Union

The name of column(s) containing the sequences.

['dna_seq', 'rna_seq', 'protein_seq']
rename_cols dict

A dictionary mapping the original column names to the new column names.

{'dna_seq': 'dna_sequences', 'rna_seq': 'rna_sequences', 'protein_seq': 'protein_sequences'}
valid_split_name

The name of the validation split.

'valid'
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

'train_*.tsv'
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

'test.tsv'
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

'validation.tsv'
normalize bool

Whether to normalize the labels.

True
y_col Union

The name of columns(s) containing the labels.

'labels'
generate_uid bool

Whether to generate a unique ID for each sample.

False
train_split_name Optional

The name of the training split.

'train'
test_split_name Optional

The name of the test split. Also used for mgen predict.

'test'
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

Base Classes

modelgenerator.data.DataInterface

Bases: LightningDataModule, KFoldMixin

Base class for all data modules in this project. Handles the boilerplate of setting up data loaders.

Note

Subclasses must implement the setup method. All datasets should return a dictionary of data items. To use HF loading, add the HFDatasetLoaderMixin. For any task-specific behaviors, implement transformations using torch.utils.data.Dataset objects. See MLM for an example.

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

required
config_name Optional

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
train_split_name Optional

The name of the training split.

'train'
test_split_name Optional

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional

The name of the validation split.

None
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

modelgenerator.data.ColumnRetrievalDataModule

Bases: DataInterface, HFDatasetLoaderMixin

Simple data module for retrieving and renaming columns from a dataset.

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

required
config_name Optional

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
in_cols List

The name of the columns to retrieve.

[]
out_cols Optional

The name of the columns to use as the alias for the retrieved columns.

None
train_split_name Optional

The name of the training split.

'train'
test_split_name Optional

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional

The name of the validation split.

None
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

modelgenerator.data.SequencesDataModule

Bases: DataInterface, HFDatasetLoaderMixin

Data module for loading a simple dataset of sequences.

Note

Each sample includes a single sequence under key 'sequences' and optionally an 'id' to track outputs.

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

required
config_name Optional

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
test_split_name Optional

The name of the test split. Also used for mgen predict.

'test'
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
x_col str

The name of the column containing the sequences.

'sequence'
id_col str

The name of the column containing the ids.

'id'
train_split_name Optional

The name of the training split.

'train'
valid_split_name Optional

The name of the validation split.

None
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

modelgenerator.data.SequenceClassificationDataModule

Bases: ClassificationDataModule, HFDatasetLoaderMixin

Data module for Hugging Face sequence classification datasets.

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

required
x_col Union

The name of the column(s) containing the sequences.

'sequences'
y_col Union

The name of the column(s) containing the labels.

'labels'
rename_cols dict[str, str] | None

A dictionary mapping the original column names to the new column names.

None
config_name Optional

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
class_filter Union

Filter the dataset to only include samples with the specified class(es).

None
generate_uid bool

Whether to generate a unique ID for each sample.

False
train_split_name Optional

The name of the training split.

'train'
test_split_name Optional

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional

The name of the validation split.

None
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

modelgenerator.data.SequenceRegressionDataModule

Bases: RegressionDataModule, HFDatasetLoaderMixin

Data module for sequence regression datasets.

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

required
x_col Union

The name of column(s) containing the sequences.

'sequences'
y_col Union

The name of columns(s) containing the labels.

'labels'
rename_cols dict[str, str] | None

A dictionary mapping the original column names to the new column names.

None
config_name Optional

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
normalize bool

Whether to normalize the labels.

True
generate_uid bool

Whether to generate a unique ID for each sample.

False
train_split_name Optional

The name of the training split.

'train'
test_split_name Optional

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional

The name of the validation split.

None
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

modelgenerator.data.TokenClassificationDataModule

Bases: DataInterface, HFDatasetLoaderMixin

Data module for Hugging Face token classification datasets.

Note

Each sample includes a single sequence under key 'sequences' and a single class sequence under key 'labels'

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

required
config_name Optional

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
x_col str

The name of the column containing the sequences.

'sequences'
y_col str

The name of the column containing the labels.

'labels'
extra_cols Optional

Additional columns to include in the dataset.

None
rename_cols dict[str, str] | None

A dictionary mapping the original column names to the new column names.

None
max_length Optional

The maximum length of the sequences.

None
truncate_extra_cols bool

Whether to truncate the extra columns to the maximum length.

False
pairwise bool

Whether the labels are pairwise.

False
collate_fn Optional

The function to use for collating data.

None
generate_uid bool

Whether to generate a unique ID for each sample.

False
train_split_name Optional

The name of the training split.

'train'
test_split_name Optional

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional

The name of the validation split.

None
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

modelgenerator.data.DiffusionDataModule

Bases: DataInterface, HFDatasetLoaderMixin

Data module for datasets with discrete diffusion-based noising and loss weights from MDLM.

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

required
config_name Optional

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
x_col str

The column with the data to train on.

'sequences'
rename_cols dict[str, str] | None

A dictionary mapping the original column names to the new column names.

None
timesteps_per_sample int

The number of timesteps per sample.

10
randomize_targets bool

Whether to randomize the target sequences for each timestep (experimental efficiency boost).

False
train_split_name Optional

The name of the training split.

'train'
test_split_name Optional

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional

The name of the validation split.

None
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1
Notes

Each sample includes timesteps_per_sample sequences at different noise levels Each sample's target sequences are under 'target_sequences', the input sequences are under 'sequences', and posterior weights are under 'posterior_weights'

modelgenerator.data.ClassDiffusionDataModule

Bases: SequenceClassificationDataModule

Data module for conditional (or class-filtered) diffusion, and applying discrete diffusion noising.

Note

Each sample includes timesteps_per_sample sequences at different noise levels Each sample's target sequences are under 'target_seqs', the input sequences are under 'input_seqs', and posterior weights are under 'posterior_weights'

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

required
config_name Optional

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
x_col str

The name of the column(s) containing the sequences.

'sequences'
y_col Union

The name of the column(s) containing the labels.

'labels'
rename_cols dict[str, str] | None

A dictionary mapping the original column names to the new column names.

None
timesteps_per_sample int

The number of timesteps per sample.

10
randomize_targets bool

Whether to randomize the target sequences for each timestep (experimental efficiency boost).

False
class_filter Union

Filter the dataset to only include samples with the specified class(es).

None
generate_uid bool

Whether to generate a unique ID for each sample.

False
train_split_name Optional

The name of the training split.

'train'
test_split_name Optional

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional

The name of the validation split.

None
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

modelgenerator.data.ConditionalDiffusionDataModule

Bases: SequenceRegressionDataModule

Data module for conditional diffusion with a continuous condition, and applying discrete diffusion noising.

Note

Each sample includes timesteps_per_sample sequences at different noise levels Each sample's target sequences are under 'target_seqs', the input sequences are under 'input_seqs', and posterior weights are under 'posterior_weights'

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

required
config_name Optional

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
x_col str

The name of column(s) containing the sequences.

'sequences'
y_col str

The name of columns(s) containing the labels.

'labels'
rename_cols dict[str, str] | None

A dictionary mapping the original column names to the new column names.

None
normalize bool

Whether to normalize the labels.

True
generate_uid bool

Whether to generate a unique ID for each sample.

False
timesteps_per_sample int

The number of timesteps per sample.

10
randomize_targets bool

Whether to randomize the target sequences for each timestep (experimental efficiency boost).

False
train_split_name Optional

The name of the training split.

'train'
test_split_name Optional

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional

The name of the validation split.

None
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

modelgenerator.data.MLMDataModule

Bases: SequenceClassificationDataModule

Data module for continuing pretraining on a masked language modeling task.

Note

Each sample includes a single sequence under key 'sequences' and a single target sequence under key 'target_sequences'

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier.

required
config_name Optional

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
x_col str

The name of the column containing the sequences. Defaults to "sequences".

'sequences'
y_col Union

The name of the column(s) containing the labels.

'labels'
masking_rate float

The masking rate. Defaults to 0.15.

0.15
rename_cols dict[str, str] | None

A dictionary mapping the original column names to the new column names.

None
class_filter Union

Filter the dataset to only include samples with the specified class(es).

None
generate_uid bool

Whether to generate a unique ID for each sample.

False
train_split_name Optional

The name of the training split.

'train'
test_split_name Optional

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional

The name of the validation split.

None
train_split_files Union

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Union

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Union

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1