Data
Data modules specify data sources, as well as data loading and preprocessing for use with Tasks.
They provide a simple interface for swapping data sources and re-using datasets for new workflows without any code changes, enabling rapid and reproducible experimentation.
They are specified with the --data arguent in the CLI or in the data section of a configuration file.
Data modules can automatically load common data sources (json, tsv, txt, HuggingFace) and uncommon ones (h5ad, TileDB).
They transform, split, and sample these sources for training with mgen fit, evaluation with mgen test/validate, and inference with mgen predict.
This reference overviews the available no-code data modules. If you would like to develop new datasets, see Experiment Design.
data:
class_path: modelgenerator.data.DMSFitnessPrediction
init_args:
path: genbio-ai/ProteinGYM-DMS
train_split_files:
- indels/B1LPA6_ECOSM_Russ_2020_indels.tsv
train_split_name: train
random_seed: 42
batch_size: 32
cv_num_folds: 5
cv_test_fold_id: 0
cv_enable_val_fold: true
cv_fold_id_col: fold_id
model:
...
trainer:
...
Note: Data modules are designed for use with a specific task, indicated in the class name.
DNA
modelgenerator.data.NTClassification
Bases: SequenceClassificationDataModule
Nucleotide Transformer benchmarks from InstaDeep.
Note
- Manuscript: The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics
- Data Card: InstaDeepAI/nucleotide_transformer_downstream_tasks
- Configs:
promoter_allpromoter_tatapromoter_no_tataenhancersenhancers_typessplice_sites_allsplice_sites_acceptorsplice_sites_donorH3H4H3K9acH3K14acH4acH3K4me1H3K4me2H3K4me3H3K36me3H3K79me3
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
'InstaDeepAI/nucleotide_transformer_downstream_tasks'
|
config_name
|
str
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
'enhancers'
|
x_col
|
Union
|
The name of the column(s) containing the sequences. |
'sequence'
|
y_col
|
Union
|
The name of the column(s) containing the labels. |
'label'
|
rename_cols
|
dict
|
A dictionary mapping the original column names to the new column names. |
{'sequence': 'sequences'}
|
class_filter
|
Union
|
Filter the dataset to only include samples with the specified class(es). |
None
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
test_split_name
|
Optional
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional
|
The name of the validation split. |
None
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
modelgenerator.data.GUEClassification
Bases: SequenceClassificationDataModule
Genome Understanding Evaluation benchmarks for DNABERT-2 from the Liu Lab at Northwestern.
Note
- Manuscript: DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome
- Data Card: leannmlindsey/GUE
- Configs:
emp_H3emp_H3K14acemp_H3K36me3emp_H3K4me1emp_H3K4me2emp_H3K4me3emp_H3K79me3emp_H3K9acemp_H4emp_H4achuman_tf_0human_tf_1human_tf_2human_tf_3human_tf_4mouse_0mouse_1mouse_2mouse_3mouse_4prom_300_allprom_300_notataprom_300_tataprom_core_allprom_core_notataprom_core_tatasplice_reconstructedvirus_covidvirus_species_40fungi_species_20EPI_K562EPI_HeLa-S3EPI_NHEKEPI_IMR90EPI_HUVEC
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
'leannmlindsey/GUE'
|
config_name
|
str
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
'emp_H3'
|
x_col
|
Union
|
The name of the column(s) containing the sequences. |
'sequence'
|
y_col
|
Union
|
The name of the column(s) containing the labels. |
'label'
|
rename_cols
|
dict
|
A dictionary mapping the original column names to the new column names. |
{'sequence': 'sequences'}
|
class_filter
|
Union
|
Filter the dataset to only include samples with the specified class(es). |
None
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
test_split_name
|
Optional
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional
|
The name of the validation split. |
None
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
modelgenerator.data.ClinvarRetrieve
Bases: ZeroshotClassificationRetrieveDataModule
ClinVar dataset for genomic variant effect prediction.
Note
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
None
|
test_split_files
|
List
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
['ClinVar_Processed.tsv']
|
reference_file
|
str
|
The file path to the reference file for retrieving sequences |
'hg38.ml.fa'
|
method
|
str
|
method mode to compute metrics |
'Distance'
|
window
|
int
|
The number of token taken on either side of the mutation site. The processed sequence length is |
512
|
config_name
|
Optional
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
test_split_name
|
Optional
|
The name of the test split. Also used for |
'test'
|
index_cols
|
List
|
The list of the column name containing the index for sequence retrieval. |
['chrom', 'start', 'end', 'ref', 'mutate']
|
y_col
|
str
|
The name of the column containing the labels. Defaults to "label". |
'label'
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
valid_split_name
|
Optional
|
The name of the validation split. |
None
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
modelgenerator.data.PromoterExpressionRegression
Bases: SequenceRegressionDataModule
Gene expression prediction from promoter sequences from the Regev Lab at the Broad Institute.
Note
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
'genbio-ai/100M-random-promoters'
|
x_col
|
Union
|
The name of column(s) containing the sequences. |
'sequence'
|
y_col
|
Union
|
The name of columns(s) containing the labels. |
'label'
|
rename_cols
|
dict
|
A dictionary mapping the original column names to the new column names. |
{'sequence': 'sequences'}
|
normalize
|
bool
|
Whether to normalize the labels. |
True
|
config_name
|
Optional
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
test_split_name
|
Optional
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional
|
The name of the validation split. |
None
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
modelgenerator.data.PromoterExpressionGeneration
Bases: ConditionalDiffusionDataModule
Promoter generation from gene expression data from the Regev Lab at the Broad Institute.
Note
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
'genbio-ai/100M-random-promoters'
|
x_col
|
Union
|
The name of column(s) containing the sequences. |
'sequence'
|
y_col
|
Union
|
The name of columns(s) containing the labels. |
'label'
|
rename_cols
|
dict
|
A dictionary mapping the original column names to the new column names. |
{'sequence': 'sequences'}
|
normalize
|
bool
|
Whether to normalize the labels. |
True
|
config_name
|
Optional
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
timesteps_per_sample
|
int
|
The number of timesteps per sample. |
10
|
randomize_targets
|
bool
|
Whether to randomize the target sequences for each timestep (experimental efficiency boost). |
False
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
test_split_name
|
Optional
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional
|
The name of the validation split. |
None
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
modelgenerator.data.DependencyMappingDataModule
Bases: SequencesDataModule
Data module for doing dependency mapping via in silico mutagenesis on a dataset of sequences. Only uses the test set.
Note
Each sample includes a single sequence under key 'sequences' and optionally an 'ids' to track outputs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
required |
vocab_file
|
str
|
The path to the file with the vocabulary to mutate. |
required |
config_name
|
Optional
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
test_split_name
|
Optional
|
The name of the test split. Also used for |
'test'
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
x_col
|
str
|
The name of the column containing the sequences. Defaults to "sequence". |
'sequence'
|
id_col
|
str
|
The name of the column containing the ids. Defaults to "id". |
'id'
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
valid_split_name
|
Optional
|
The name of the validation split. |
None
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
RNA
modelgenerator.data.TranslationEfficiency
Bases: SequenceRegressionDataModule
Translation efficiency prediction benchmarks from the Wang Lab at Princeton.
Note
- Manuscript: A 5′ UTR language model for decoding untranslated regions of mRNA and function predictions
- Data Card: genbio-ai/rna-downstream-tasks
- Configs:
translation_efficiency_Muscletranslation_efficiency_HEKtranslation_efficiency_pc3
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
'genbio-ai/rna-downstream-tasks'
|
config_name
|
str
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
'translation_efficiency_Muscle'
|
x_col
|
The name of column(s) containing the sequences. |
'sequences'
|
|
y_col
|
The name of columns(s) containing the labels. |
'labels'
|
|
normalize
|
bool
|
Whether to normalize the labels. |
True
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
10
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_fold_id_col
|
str
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
'fold_id'
|
valid_split_name
|
str
|
The name of the validation split. |
None
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0
|
test_split_name
|
str
|
The name of the test split. Also used for |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0
|
rename_cols
|
dict[str, str] | None
|
A dictionary mapping the original column names to the new column names. |
None
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
modelgenerator.data.ExpressionLevel
Bases: SequenceRegressionDataModule
Expression level prediction benchmarks from the Wang Lab at Princeton.
Note
- Manuscript: A 5′ UTR language model for decoding untranslated regions of mRNA and function predictions
- Data Card: genbio-ai/rna-downstream-tasks
- Configs:
expression_Muscleexpression_HEKexpression_pc3
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
'genbio-ai/rna-downstream-tasks'
|
config_name
|
str
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
'expression_Muscle'
|
x_col
|
Union
|
The name of column(s) containing the sequences. |
'sequences'
|
y_col
|
Union
|
The name of columns(s) containing the labels. |
'labels'
|
normalize
|
bool
|
Whether to normalize the labels. |
True
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
10
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_fold_id_col
|
str
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
'fold_id'
|
valid_split_name
|
str
|
The name of the validation split. |
None
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0
|
test_split_name
|
str
|
The name of the test split. Also used for |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0
|
rename_cols
|
dict[str, str] | None
|
A dictionary mapping the original column names to the new column names. |
None
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
modelgenerator.data.TranscriptAbundance
Bases: SequenceRegressionDataModule
Transcript abundance prediction benchmarks from the Wang Lab at Princeton.
Note
- Manuscript: A 5′ UTR language model for decoding untranslated regions of mRNA and function predictions
- Data Card: genbio-ai/rna-downstream-tasks
- Configs:
transcript_abundance_athalianatranscript_abundance_dmelanogastertranscript_abundance_ecolitranscript_abundance_hsapienstranscript_abundance_hvolcaniitranscript_abundance_ppastoristranscript_abundance_scerevisiae
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
'genbio-ai/rna-downstream-tasks'
|
config_name
|
str
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
'transcript_abundance_athaliana'
|
x_col
|
Union
|
The name of column(s) containing the sequences. |
'sequences'
|
y_col
|
Union
|
The name of columns(s) containing the labels. |
'labels'
|
normalize
|
bool
|
Whether to normalize the labels. |
True
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
5
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_fold_id_col
|
str
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
'fold_id'
|
valid_split_name
|
str
|
The name of the validation split. |
None
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0
|
test_split_name
|
str
|
The name of the test split. Also used for |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0
|
rename_cols
|
dict[str, str] | None
|
A dictionary mapping the original column names to the new column names. |
None
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
modelgenerator.data.ProteinAbundance
Bases: SequenceRegressionDataModule
Protein abundance prediction benchmarks from the Wang Lab at Princeton.
Note
- Manuscript: A 5′ UTR language model for decoding untranslated regions of mRNA and function predictions
- Data Card: genbio-ai/rna-downstream-tasks
- Configs:
protein_abundance_athalianaprotein_abundance_dmelanogasterprotein_abundance_ecoliprotein_abundance_hsapiensprotein_abundance_scerevisiae
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
'genbio-ai/rna-downstream-tasks'
|
config_name
|
str
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
'protein_abundance_athaliana'
|
x_col
|
Union
|
The name of column(s) containing the sequences. |
'sequences'
|
y_col
|
Union
|
The name of columns(s) containing the labels. |
'labels'
|
normalize
|
bool
|
Whether to normalize the labels. |
True
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
5
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_fold_id_col
|
str
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
'fold_id'
|
valid_split_name
|
str
|
The name of the validation split. |
None
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0
|
test_split_name
|
str
|
The name of the test split. Also used for |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0
|
rename_cols
|
dict[str, str] | None
|
A dictionary mapping the original column names to the new column names. |
None
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
modelgenerator.data.NcrnaFamilyClassification
Bases: SequenceClassificationDataModule
Non-coding RNA family classification benchmarks from DPTechnology.
Note
- Manuscript: UNI-RNA: UNIVERSAL PRE-TRAINED MODELS REVOLUTIONIZE RNA RESEARCH
- Data Card: genbio-ai/rna-downstream-tasks
- Configs:
ncrna_family_bnoise0ncrna_family_bnoise200
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
'genbio-ai/rna-downstream-tasks'
|
config_name
|
str
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
'ncrna_family_bnoise0'
|
x_col
|
Union
|
The name of the column(s) containing the sequences. |
'sequences'
|
y_col
|
Union
|
The name of the column(s) containing the labels. |
'labels'
|
train_split_name
|
str
|
The name of the training split. |
'train'
|
valid_split_name
|
str
|
The name of the validation split. |
'validation'
|
test_split_name
|
str
|
The name of the test split. Also used for |
'test'
|
rename_cols
|
dict[str, str] | None
|
A dictionary mapping the original column names to the new column names. |
None
|
class_filter
|
Union
|
Filter the dataset to only include samples with the specified class(es). |
None
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
modelgenerator.data.SpliceSitePrediction
Bases: SequenceClassificationDataModule
Splice site prediction benchmarks from the Thompson Lab at University of Strasbourg.
Note
- Manuscript: Spliceator: multi-species splice site prediction using convolutional neural networks
- Data Card: genbio-ai/rna-downstream-tasks
- Configs:
splice_site_acceptorsplice_site_donor
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
'genbio-ai/rna-downstream-tasks'
|
config_name
|
str
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
'splice_site_acceptor'
|
x_col
|
Union
|
The name of the column(s) containing the sequences. |
'sequences'
|
y_col
|
Union
|
The name of the column(s) containing the labels. |
'labels'
|
train_split_name
|
str
|
The name of the training split. |
'train'
|
valid_split_name
|
str
|
The name of the validation split. |
'validation'
|
test_split_name
|
str
|
The name of the test split. Also used for |
'test_danio'
|
batch_size
|
int
|
The batch size. |
16
|
rename_cols
|
dict[str, str] | None
|
A dictionary mapping the original column names to the new column names. |
None
|
class_filter
|
Union
|
Filter the dataset to only include samples with the specified class(es). |
None
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
modelgenerator.data.ModificationSitePrediction
Bases: SequenceClassificationDataModule
Modification site prediction benchmarks from the Meng Lab at the University of Liverpool.
Note
- Manuscript: Attention-based multi-label neural networks for integrated prediction and interpretation of twelve widely occurring RNA modifications
- Data Card: genbio-ai/rna-downstream-tasks
- Configs:
modification_site
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
'genbio-ai/rna-downstream-tasks'
|
config_name
|
str
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
'modification_site'
|
x_col
|
Union
|
The name of the column(s) containing the sequences. |
'sequences'
|
y_col
|
List
|
The name of the column(s) containing the labels. |
['labels_0', 'labels_1', 'labels_2', 'labels_3', 'labels_4', 'labels_5', 'labels_6', 'labels_7', 'labels_8', 'labels_9', 'labels_10', 'labels_11']
|
train_split_name
|
str
|
The name of the training split. |
'train'
|
valid_split_name
|
str
|
The name of the validation split. |
'validation'
|
test_split_name
|
str
|
The name of the test split. Also used for |
'test'
|
rename_cols
|
dict[str, str] | None
|
A dictionary mapping the original column names to the new column names. |
None
|
class_filter
|
Union
|
Filter the dataset to only include samples with the specified class(es). |
None
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
modelgenerator.data.RNAMeanRibosomeLoadDataModule
Bases: SequenceRegressionDataModule
Data module for the mean ribosome load dataset.
Note
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
'genbio-ai/rna-downstream-tasks'
|
config_name
|
str
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
'mean_ribosome_load'
|
train_split_name
|
str
|
The name of the training split. |
'train'
|
valid_split_name
|
str
|
The name of the validation split. |
'validation'
|
test_split_name
|
str
|
The name of the test split. Also used for |
'test'
|
x_col
|
str
|
The name of column(s) containing the sequences. |
'utr'
|
y_col
|
str
|
The name of columns(s) containing the labels. |
'rl'
|
rename_cols
|
dict
|
A dictionary mapping the original column names to the new column names. |
{'utr': 'sequences'}
|
normalize
|
bool
|
Whether to normalize the labels. |
False
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
Protein
modelgenerator.data.ContactPredictionBinary
Bases: TokenClassificationDataModule
Protein contact prediction benchmarks from BioMap.
Note
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
'proteinglm/contact_prediction_binary'
|
pairwise
|
bool
|
Whether the labels are pairwise. |
True
|
x_col
|
str
|
The name of the column containing the sequences. |
'seq'
|
y_col
|
str
|
The name of the column containing the labels. |
'label'
|
rename_cols
|
dict
|
A dictionary mapping the original column names to the new column names. |
{'seq': 'sequences'}
|
batch_size
|
int
|
The batch size. |
1
|
max_context_length
|
int
|
Maximum context length for the input sequences. |
12800
|
msa_random_seed
|
Optional
|
Random seed for MSA generation. |
None
|
is_rag_dataset
|
bool
|
Whether the dataset is a RAG dataset for AIDO.Protein-RAG. |
False
|
config_name
|
Optional
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
extra_cols
|
Optional
|
Additional columns to include in the dataset. |
None
|
max_length
|
Optional
|
The maximum length of the sequences. |
None
|
truncate_extra_cols
|
bool
|
Whether to truncate the extra columns to the maximum length. |
False
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
test_split_name
|
Optional
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional
|
The name of the validation split. |
None
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
modelgenerator.data.SspQ3
Bases: TokenClassificationDataModule
Protein secondary structure prediction benchmarks from BioMap.
Note
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
'proteinglm/ssp_q3'
|
pairwise
|
bool
|
Whether the labels are pairwise. |
False
|
x_col
|
str
|
The name of the column containing the sequences. |
'seq'
|
y_col
|
str
|
The name of the column containing the labels. |
'label'
|
rename_cols
|
dict
|
A dictionary mapping the original column names to the new column names. |
{'seq': 'sequences'}
|
batch_size
|
int
|
The batch size. |
1
|
max_context_length
|
int
|
Maximum context length for the input sequences. |
12800
|
msa_random_seed
|
Optional
|
Random seed for MSA generation. |
None
|
is_rag_dataset
|
bool
|
Whether the dataset is a RAG dataset for AIDO.Protein-RAG. |
False
|
config_name
|
Optional
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
extra_cols
|
Optional
|
Additional columns to include in the dataset. |
None
|
max_length
|
Optional
|
The maximum length of the sequences. |
None
|
truncate_extra_cols
|
bool
|
Whether to truncate the extra columns to the maximum length. |
False
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
test_split_name
|
Optional
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional
|
The name of the validation split. |
None
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
modelgenerator.data.FoldPrediction
Bases: SequenceClassificationDataModule
Protein fold prediction benchmarks from BioMap.
Note
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
'proteinglm/fold_prediction'
|
x_col
|
Union
|
The name of the column(s) containing the sequences. |
'seq'
|
y_col
|
Union
|
The name of the column(s) containing the labels. |
'label'
|
rename_cols
|
dict
|
A dictionary mapping the original column names to the new column names. |
{'seq': 'sequences'}
|
max_context_length
|
int
|
Maximum context length for the input sequences. |
12800
|
msa_random_seed
|
Optional
|
Random seed for MSA generation. |
None
|
is_rag_dataset
|
bool
|
Whether the dataset is a RAG dataset for AIDO.Protein-RAG. |
False
|
config_name
|
Optional
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
class_filter
|
Union
|
Filter the dataset to only include samples with the specified class(es). |
None
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
test_split_name
|
Optional
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional
|
The name of the validation split. |
None
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
modelgenerator.data.LocalizationPrediction
Bases: SequenceClassificationDataModule
Protein localization prediction benchmarks from BioMap.
Note
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
'proteinglm/localization_prediction'
|
x_col
|
Union
|
The name of the column(s) containing the sequences. |
'seq'
|
y_col
|
Union
|
The name of the column(s) containing the labels. |
'label'
|
rename_cols
|
dict
|
A dictionary mapping the original column names to the new column names. |
{'seq': 'sequences'}
|
config_name
|
Optional
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
class_filter
|
Union
|
Filter the dataset to only include samples with the specified class(es). |
None
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
test_split_name
|
Optional
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional
|
The name of the validation split. |
None
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
modelgenerator.data.MetalIonBinding
Bases: SequenceClassificationDataModule
Metal ion binding prediction benchmarks from BioMap.
Note
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
'proteinglm/metal_ion_binding'
|
x_col
|
Union
|
The name of the column(s) containing the sequences. |
'seq'
|
y_col
|
Union
|
The name of the column(s) containing the labels. |
'label'
|
rename_cols
|
dict
|
A dictionary mapping the original column names to the new column names. |
{'seq': 'sequences'}
|
config_name
|
Optional
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
class_filter
|
Union
|
Filter the dataset to only include samples with the specified class(es). |
None
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
test_split_name
|
Optional
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional
|
The name of the validation split. |
None
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
modelgenerator.data.SolubilityPrediction
Bases: SequenceClassificationDataModule
Protein solubility prediction benchmarks from BioMap.
Note
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
'proteinglm/solubility_prediction'
|
x_col
|
Union
|
The name of the column(s) containing the sequences. |
'seq'
|
y_col
|
Union
|
The name of the column(s) containing the labels. |
'label'
|
rename_cols
|
dict
|
A dictionary mapping the original column names to the new column names. |
{'seq': 'sequences'}
|
config_name
|
Optional
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
class_filter
|
Union
|
Filter the dataset to only include samples with the specified class(es). |
None
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
test_split_name
|
Optional
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional
|
The name of the validation split. |
None
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
modelgenerator.data.AntibioticResistance
Bases: SequenceClassificationDataModule
Antibiotic resistance prediction benchmarks from BioMap.
Note
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
'proteinglm/antibiotic_resistance'
|
x_col
|
Union
|
The name of the column(s) containing the sequences. |
'seq'
|
y_col
|
Union
|
The name of the column(s) containing the labels. |
'label'
|
rename_cols
|
dict
|
A dictionary mapping the original column names to the new column names. |
{'seq': 'sequences'}
|
config_name
|
Optional
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
class_filter
|
Union
|
Filter the dataset to only include samples with the specified class(es). |
None
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
test_split_name
|
Optional
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional
|
The name of the validation split. |
None
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
modelgenerator.data.CloningClf
Bases: SequenceClassificationDataModule
Cloning classification prediction benchmarks from BioMap.
Note
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
'proteinglm/cloning_clf'
|
x_col
|
Union
|
The name of the column(s) containing the sequences. |
'seq'
|
y_col
|
Union
|
The name of the column(s) containing the labels. |
'label'
|
rename_cols
|
dict
|
A dictionary mapping the original column names to the new column names. |
{'seq': 'sequences'}
|
config_name
|
Optional
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
class_filter
|
Union
|
Filter the dataset to only include samples with the specified class(es). |
None
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
test_split_name
|
Optional
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional
|
The name of the validation split. |
None
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
modelgenerator.data.MaterialProduction
Bases: SequenceClassificationDataModule
Material production prediction benchmarks from BioMap.
Note
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
'proteinglm/material_production'
|
x_col
|
Union
|
The name of the column(s) containing the sequences. |
'seq'
|
y_col
|
Union
|
The name of the column(s) containing the labels. |
'label'
|
rename_cols
|
dict
|
A dictionary mapping the original column names to the new column names. |
{'seq': 'sequences'}
|
config_name
|
Optional
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
class_filter
|
Union
|
Filter the dataset to only include samples with the specified class(es). |
None
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
test_split_name
|
Optional
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional
|
The name of the validation split. |
None
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
modelgenerator.data.TcrPmhcAffinity
Bases: SequenceClassificationDataModule
TCR-pMHC affinity prediction benchmarks from BioMap.
Note
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
'proteinglm/tcr_pmhc_affinity'
|
x_col
|
Union
|
The name of the column(s) containing the sequences. |
'seq'
|
y_col
|
Union
|
The name of the column(s) containing the labels. |
'label'
|
rename_cols
|
dict
|
A dictionary mapping the original column names to the new column names. |
{'seq': 'sequences'}
|
config_name
|
Optional
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
class_filter
|
Union
|
Filter the dataset to only include samples with the specified class(es). |
None
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
test_split_name
|
Optional
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional
|
The name of the validation split. |
None
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
modelgenerator.data.PeptideHlaMhcAffinity
Bases: SequenceClassificationDataModule
Peptide-HLA-MHC affinity prediction benchmarks from BioMap. Note: - Manuscript: xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein - Data Card: proteinglm/peptide_HLA_MHC_affinity
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
'proteinglm/peptide_HLA_MHC_affinity'
|
x_col
|
Union
|
The name of the column(s) containing the sequences. |
'seq'
|
y_col
|
Union
|
The name of the column(s) containing the labels. |
'label'
|
rename_cols
|
dict
|
A dictionary mapping the original column names to the new column names. |
{'seq': 'sequences'}
|
config_name
|
Optional
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
class_filter
|
Union
|
Filter the dataset to only include samples with the specified class(es). |
None
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
test_split_name
|
Optional
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional
|
The name of the validation split. |
None
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
modelgenerator.data.TemperatureStability
Bases: SequenceClassificationDataModule
Temperature stability prediction benchmarks from BioMap.
Note
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
'proteinglm/temperature_stability'
|
x_col
|
Union
|
The name of the column(s) containing the sequences. |
'seq'
|
y_col
|
Union
|
The name of the column(s) containing the labels. |
'label'
|
rename_cols
|
dict
|
A dictionary mapping the original column names to the new column names. |
{'seq': 'sequences'}
|
config_name
|
Optional
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
class_filter
|
Union
|
Filter the dataset to only include samples with the specified class(es). |
None
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
test_split_name
|
Optional
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional
|
The name of the validation split. |
None
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
modelgenerator.data.FluorescencePrediction
Bases: SequenceRegressionDataModule
Fluorescence prediction benchmarks from BioMap.
Note
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
'proteinglm/fluorescence_prediction'
|
x_col
|
Union
|
The name of column(s) containing the sequences. |
'seq'
|
y_col
|
Union
|
The name of columns(s) containing the labels. |
'label'
|
rename_cols
|
dict
|
A dictionary mapping the original column names to the new column names. |
{'seq': 'sequences'}
|
normalize
|
bool
|
Whether to normalize the labels. |
True
|
max_context_length
|
int
|
Maximum context length for the input sequences. |
12800
|
msa_random_seed
|
Optional
|
Random seed for MSA generation. |
None
|
is_rag_dataset
|
bool
|
Whether the dataset is a RAG dataset for AIDO.Protein-RAG. |
False
|
config_name
|
Optional
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
test_split_name
|
Optional
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional
|
The name of the validation split. |
None
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
modelgenerator.data.FitnessPrediction
Bases: SequenceRegressionDataModule
Fitness prediction benchmarks from BioMap.
Note
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
'proteinglm/fitness_prediction'
|
x_col
|
Union
|
The name of column(s) containing the sequences. |
'seq'
|
y_col
|
Union
|
The name of columns(s) containing the labels. |
'label'
|
rename_cols
|
dict
|
A dictionary mapping the original column names to the new column names. |
{'seq': 'sequences'}
|
normalize
|
bool
|
Whether to normalize the labels. |
True
|
config_name
|
Optional
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
test_split_name
|
Optional
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional
|
The name of the validation split. |
None
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
modelgenerator.data.StabilityPrediction
Bases: SequenceRegressionDataModule
Stability prediction benchmarks from BioMap.
Note
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
'proteinglm/stability_prediction'
|
x_col
|
Union
|
The name of column(s) containing the sequences. |
'seq'
|
y_col
|
Union
|
The name of columns(s) containing the labels. |
'label'
|
rename_cols
|
dict
|
A dictionary mapping the original column names to the new column names. |
{'seq': 'sequences'}
|
normalize
|
bool
|
Whether to normalize the labels. |
True
|
config_name
|
Optional
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
test_split_name
|
Optional
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional
|
The name of the validation split. |
None
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
modelgenerator.data.EnzymeCatalyticEfficiencyPrediction
Bases: SequenceRegressionDataModule
Enzyme catalytic efficiency prediction benchmarks from BioMap.
Note
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
'proteinglm/enzyme_catalytic_efficiency'
|
x_col
|
Union
|
The name of column(s) containing the sequences. |
'seq'
|
y_col
|
Union
|
The name of columns(s) containing the labels. |
'label'
|
rename_cols
|
dict
|
A dictionary mapping the original column names to the new column names. |
{'seq': 'sequences'}
|
normalize
|
bool
|
Whether to normalize the labels. |
True
|
config_name
|
Optional
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
test_split_name
|
Optional
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional
|
The name of the validation split. |
None
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
modelgenerator.data.OptimalTemperaturePrediction
Bases: SequenceRegressionDataModule
Optimal temperature prediction benchmarks from BioMap.
Note
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
'proteinglm/optimal_temperature'
|
x_col
|
Union
|
The name of column(s) containing the sequences. |
'seq'
|
y_col
|
Union
|
The name of columns(s) containing the labels. |
'label'
|
rename_cols
|
dict
|
A dictionary mapping the original column names to the new column names. |
{'seq': 'sequences'}
|
normalize
|
bool
|
Whether to normalize the labels. |
True
|
config_name
|
Optional
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
test_split_name
|
Optional
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional
|
The name of the validation split. |
None
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
modelgenerator.data.OptimalPhPrediction
Bases: SequenceRegressionDataModule
Optimal pH prediction benchmarks from BioMap.
Note
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
'proteinglm/optimal_ph'
|
x_col
|
Union
|
The name of column(s) containing the sequences. |
'seq'
|
y_col
|
Union
|
The name of columns(s) containing the labels. |
'label'
|
rename_cols
|
dict
|
A dictionary mapping the original column names to the new column names. |
{'seq': 'sequences'}
|
normalize
|
bool
|
Whether to normalize the labels. |
True
|
config_name
|
Optional
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
test_split_name
|
Optional
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional
|
The name of the validation split. |
None
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
modelgenerator.data.DMSFitnessPrediction
Bases: SequenceRegressionDataModule
Deep mutational scanning (DMS) fitness prediction benchmarks from the Gal Lab at Oxford and the Marks Lab at Harvard.
Note
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
'genbio-ai/ProteinGYM-DMS'
|
train_split_files
|
list
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
['indels/B1LPA6_ECOSM_Russ_2020_indels.tsv']
|
x_col
|
Union
|
The name of column(s) containing the sequences. |
'sequences'
|
y_col
|
Union
|
The name of columns(s) containing the labels. |
'labels'
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
5
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
str
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
'fold_id'
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
-1
|
valid_split_name
|
str
|
The name of the validation split. |
None
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0
|
test_split_name
|
str
|
The name of the test split. Also used for |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0
|
max_context_length
|
int
|
Maximum context length for the input sequences. |
12800
|
msa_random_seed
|
Optional
|
Random seed for MSA generation. |
None
|
is_rag_dataset
|
bool
|
Whether the dataset is a RAG dataset for AIDO.Protein-RAG. |
False
|
rename_cols
|
dict[str, str] | None
|
A dictionary mapping the original column names to the new column names. |
None
|
config_name
|
Optional
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
normalize
|
bool
|
Whether to normalize the labels. |
True
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
Structure
modelgenerator.data.ContactPredictionBinary
Bases: TokenClassificationDataModule
Protein contact prediction benchmarks from BioMap.
Note
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
'proteinglm/contact_prediction_binary'
|
pairwise
|
bool
|
Whether the labels are pairwise. |
True
|
x_col
|
str
|
The name of the column containing the sequences. |
'seq'
|
y_col
|
str
|
The name of the column containing the labels. |
'label'
|
rename_cols
|
dict
|
A dictionary mapping the original column names to the new column names. |
{'seq': 'sequences'}
|
batch_size
|
int
|
The batch size. |
1
|
max_context_length
|
int
|
Maximum context length for the input sequences. |
12800
|
msa_random_seed
|
Optional
|
Random seed for MSA generation. |
None
|
is_rag_dataset
|
bool
|
Whether the dataset is a RAG dataset for AIDO.Protein-RAG. |
False
|
config_name
|
Optional
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
extra_cols
|
Optional
|
Additional columns to include in the dataset. |
None
|
max_length
|
Optional
|
The maximum length of the sequences. |
None
|
truncate_extra_cols
|
bool
|
Whether to truncate the extra columns to the maximum length. |
False
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
test_split_name
|
Optional
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional
|
The name of the validation split. |
None
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
modelgenerator.data.SspQ3
Bases: TokenClassificationDataModule
Protein secondary structure prediction benchmarks from BioMap.
Note
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
'proteinglm/ssp_q3'
|
pairwise
|
bool
|
Whether the labels are pairwise. |
False
|
x_col
|
str
|
The name of the column containing the sequences. |
'seq'
|
y_col
|
str
|
The name of the column containing the labels. |
'label'
|
rename_cols
|
dict
|
A dictionary mapping the original column names to the new column names. |
{'seq': 'sequences'}
|
batch_size
|
int
|
The batch size. |
1
|
max_context_length
|
int
|
Maximum context length for the input sequences. |
12800
|
msa_random_seed
|
Optional
|
Random seed for MSA generation. |
None
|
is_rag_dataset
|
bool
|
Whether the dataset is a RAG dataset for AIDO.Protein-RAG. |
False
|
config_name
|
Optional
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
extra_cols
|
Optional
|
Additional columns to include in the dataset. |
None
|
max_length
|
Optional
|
The maximum length of the sequences. |
None
|
truncate_extra_cols
|
bool
|
Whether to truncate the extra columns to the maximum length. |
False
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
test_split_name
|
Optional
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional
|
The name of the validation split. |
None
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
modelgenerator.data.FoldPrediction
Bases: SequenceClassificationDataModule
Protein fold prediction benchmarks from BioMap.
Note
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
'proteinglm/fold_prediction'
|
x_col
|
Union
|
The name of the column(s) containing the sequences. |
'seq'
|
y_col
|
Union
|
The name of the column(s) containing the labels. |
'label'
|
rename_cols
|
dict
|
A dictionary mapping the original column names to the new column names. |
{'seq': 'sequences'}
|
max_context_length
|
int
|
Maximum context length for the input sequences. |
12800
|
msa_random_seed
|
Optional
|
Random seed for MSA generation. |
None
|
is_rag_dataset
|
bool
|
Whether the dataset is a RAG dataset for AIDO.Protein-RAG. |
False
|
config_name
|
Optional
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
class_filter
|
Union
|
Filter the dataset to only include samples with the specified class(es). |
None
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
test_split_name
|
Optional
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional
|
The name of the validation split. |
None
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
modelgenerator.data.FluorescencePrediction
Bases: SequenceRegressionDataModule
Fluorescence prediction benchmarks from BioMap.
Note
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
'proteinglm/fluorescence_prediction'
|
x_col
|
Union
|
The name of column(s) containing the sequences. |
'seq'
|
y_col
|
Union
|
The name of columns(s) containing the labels. |
'label'
|
rename_cols
|
dict
|
A dictionary mapping the original column names to the new column names. |
{'seq': 'sequences'}
|
normalize
|
bool
|
Whether to normalize the labels. |
True
|
max_context_length
|
int
|
Maximum context length for the input sequences. |
12800
|
msa_random_seed
|
Optional
|
Random seed for MSA generation. |
None
|
is_rag_dataset
|
bool
|
Whether the dataset is a RAG dataset for AIDO.Protein-RAG. |
False
|
config_name
|
Optional
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
test_split_name
|
Optional
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional
|
The name of the validation split. |
None
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
modelgenerator.data.DMSFitnessPrediction
Bases: SequenceRegressionDataModule
Deep mutational scanning (DMS) fitness prediction benchmarks from the Gal Lab at Oxford and the Marks Lab at Harvard.
Note
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
'genbio-ai/ProteinGYM-DMS'
|
train_split_files
|
list
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
['indels/B1LPA6_ECOSM_Russ_2020_indels.tsv']
|
x_col
|
Union
|
The name of column(s) containing the sequences. |
'sequences'
|
y_col
|
Union
|
The name of columns(s) containing the labels. |
'labels'
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
5
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
str
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
'fold_id'
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
-1
|
valid_split_name
|
str
|
The name of the validation split. |
None
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0
|
test_split_name
|
str
|
The name of the test split. Also used for |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0
|
max_context_length
|
int
|
Maximum context length for the input sequences. |
12800
|
msa_random_seed
|
Optional
|
Random seed for MSA generation. |
None
|
is_rag_dataset
|
bool
|
Whether the dataset is a RAG dataset for AIDO.Protein-RAG. |
False
|
rename_cols
|
dict[str, str] | None
|
A dictionary mapping the original column names to the new column names. |
None
|
config_name
|
Optional
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
normalize
|
bool
|
Whether to normalize the labels. |
True
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
modelgenerator.data.StructureTokenDataModule
Bases: DataInterface, HFDatasetLoaderMixin
Test only data module for structure token predictors.
This data module is specifically designed for handling datasets uses amino acid sequences as input and structure tokens as labels.
Note
This module only supports testing and ignores training and validation splits. It assumes test split files contain sequences and optionally their structural token labels. If structural token labels are not provided, dummy labels are created.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
required |
config_name
|
Optional
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
test_split_files
|
Optional
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
batch_size
|
int
|
The batch size. |
1
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
test_split_name
|
Optional
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional
|
The name of the validation split. |
None
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
Cell
modelgenerator.data.CellClassificationDataModule
Bases: DataInterface
Data module for cell classification.
Note
Each sample includes a feature vector (one of the rows in
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
required |
backbone_class_path
|
Optional
|
Class path of the backbone model. |
None
|
filter_columns
|
Optional
|
The columns of |
None
|
rename_columns
|
Optional
|
New name of columns. Defaults to None, in which case columns are not renamed. Does nothing if filter_colums is None. |
None
|
config_name
|
Optional
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
test_split_name
|
Optional
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional
|
The name of the validation split. |
None
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
modelgenerator.data.CellClassificationLargeDataModule
Bases: DataInterface
Data module for cell classification. This class handles large dataset and is implemented based on TileDB.
Note
Each sample includes a feature vector (one of the rows in
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the TileDB dataset folder |
required |
train_split_subfolder
|
str
|
Subfolder name for the training split. |
required |
valid_split_subfolder
|
str
|
Subfolder name for the validation split. |
required |
test_split_subfolder
|
str
|
Subfolder name for the test split. |
required |
backbone_class_path
|
Optional
|
Class path of the backbone model. |
None
|
layer_name
|
str
|
Name of the layer in the TileDB dataset. |
'data'
|
obs_column_name
|
str
|
Name of the column in |
'cell_type'
|
measurement_name
|
str
|
Name of the measurement in the TileDB dataset. |
'RNA'
|
axis_query_value_filter
|
Optional
|
Optional filter for the axis query. |
None
|
prefetch_factor
|
int
|
Number of batches to prefetch. |
16
|
config_name
|
Optional
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
test_split_name
|
Optional
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional
|
The name of the validation split. |
None
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
modelgenerator.data.ClockDataModule
Bases: DataInterface
Data module for transcriptomic clock tasks.
Note
Each sample includes a feature vector (one of the rows in
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
required |
split_column
|
str
|
The column of |
required |
label_scaling
|
Optional
|
The type of label scaling to apply. |
'z_scaling'
|
backbone_class_path
|
Optional
|
Class path of the backbone model. |
None
|
filter_columns
|
Optional
|
The columns of |
None
|
rename_columns
|
Optional
|
New name of columns. Defaults to None, in which case columns are not renamed. Does nothing if filter_colums is None. |
None
|
config_name
|
Optional
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
test_split_name
|
Optional
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional
|
The name of the validation split. |
None
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
modelgenerator.data.PertClassificationDataModule
Bases: DataInterface
Data module for perturbation classification.
Note
Each sample includes a feature vector (one of the rows in
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
required |
pert_column
|
str
|
Column of |
required |
cell_line_column
|
str
|
Column of |
required |
cell_line
|
str
|
Name of cell line to consider. |
required |
split_seed
|
int
|
Seed for train/val/test splits. |
1234
|
train_frac
|
float
|
Fraction of examples to assign to train set. |
0.7
|
val_frac
|
float
|
Fraction of examples to assign to val set. |
0.15
|
test_frac
|
float
|
Fraction of examples to assign to test set. |
0.15
|
backbone_class_path
|
Optional
|
Class path of the backbone model. |
None
|
filter_columns
|
Optional
|
The columns of |
None
|
rename_columns
|
Optional
|
New name of columns. Defaults to None, in which case columns are not renamed. Does nothing if filter_colums is None. |
None
|
config_name
|
Optional
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
test_split_name
|
Optional
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional
|
The name of the validation split. |
None
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
Tissue
modelgenerator.data.CellWithNeighborDataModule
Bases: DataInterface
Data module for cell classification with neighbors for AIDO.Tissue.
Note
Each sample includes a feature vector (one of the rows in
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
required |
filter_columns
|
Optional
|
The columns of |
None
|
rename_columns
|
Optional
|
Optional list of columns to rename. |
None
|
use_random_neighbor
|
bool
|
Whether to use random neighbors. |
False
|
copy_center_as_neighbor
|
bool
|
Whether to copy center as a neighbor. |
False
|
neighbor_num
|
int
|
Number of neighbors to consider. |
10
|
generate_uid
|
bool
|
Whether to generate a unique identifier. |
False
|
config_name
|
Optional
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
test_split_name
|
Optional
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional
|
The name of the validation split. |
None
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
Multimodal
modelgenerator.data.IsoformExpression
Bases: SequenceRegressionDataModule
Isoform expression prediction benchmarks from the
Note
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
'genbio-ai/transcript_isoform_expression_prediction'
|
config_name
|
str
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
x_col
|
Union
|
The name of column(s) containing the sequences. |
['dna_seq', 'rna_seq', 'protein_seq']
|
rename_cols
|
dict
|
A dictionary mapping the original column names to the new column names. |
{'dna_seq': 'dna_sequences', 'rna_seq': 'rna_sequences', 'protein_seq': 'protein_sequences'}
|
valid_split_name
|
The name of the validation split. |
'valid'
|
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
'train_*.tsv'
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
'test.tsv'
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
'validation.tsv'
|
normalize
|
bool
|
Whether to normalize the labels. |
True
|
y_col
|
Union
|
The name of columns(s) containing the labels. |
'labels'
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
test_split_name
|
Optional
|
The name of the test split. Also used for |
'test'
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
Base Classes
modelgenerator.data.DataInterface
Bases: LightningDataModule, KFoldMixin
Base class for all data modules in this project. Handles the boilerplate of setting up data loaders.
Note
Subclasses must implement the setup method.
All datasets should return a dictionary of data items.
To use HF loading, add the HFDatasetLoaderMixin.
For any task-specific behaviors, implement transformations using torch.utils.data.Dataset objects.
See MLM for an example.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
required |
config_name
|
Optional
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
test_split_name
|
Optional
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional
|
The name of the validation split. |
None
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
modelgenerator.data.ColumnRetrievalDataModule
Bases: DataInterface, HFDatasetLoaderMixin
Simple data module for retrieving and renaming columns from a dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
required |
config_name
|
Optional
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
in_cols
|
List
|
The name of the columns to retrieve. |
[]
|
out_cols
|
Optional
|
The name of the columns to use as the alias for the retrieved columns. |
None
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
test_split_name
|
Optional
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional
|
The name of the validation split. |
None
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
modelgenerator.data.SequencesDataModule
Bases: DataInterface, HFDatasetLoaderMixin
Data module for loading a simple dataset of sequences.
Note
Each sample includes a single sequence under key 'sequences' and optionally an 'id' to track outputs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
required |
config_name
|
Optional
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
test_split_name
|
Optional
|
The name of the test split. Also used for |
'test'
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
x_col
|
str
|
The name of the column containing the sequences. |
'sequence'
|
id_col
|
str
|
The name of the column containing the ids. |
'id'
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
valid_split_name
|
Optional
|
The name of the validation split. |
None
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
modelgenerator.data.SequenceClassificationDataModule
Bases: ClassificationDataModule, HFDatasetLoaderMixin
Data module for Hugging Face sequence classification datasets.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
required |
x_col
|
Union
|
The name of the column(s) containing the sequences. |
'sequences'
|
y_col
|
Union
|
The name of the column(s) containing the labels. |
'labels'
|
rename_cols
|
dict[str, str] | None
|
A dictionary mapping the original column names to the new column names. |
None
|
config_name
|
Optional
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
class_filter
|
Union
|
Filter the dataset to only include samples with the specified class(es). |
None
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
test_split_name
|
Optional
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional
|
The name of the validation split. |
None
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
modelgenerator.data.SequenceRegressionDataModule
Bases: RegressionDataModule, HFDatasetLoaderMixin
Data module for sequence regression datasets.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
required |
x_col
|
Union
|
The name of column(s) containing the sequences. |
'sequences'
|
y_col
|
Union
|
The name of columns(s) containing the labels. |
'labels'
|
rename_cols
|
dict[str, str] | None
|
A dictionary mapping the original column names to the new column names. |
None
|
config_name
|
Optional
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
normalize
|
bool
|
Whether to normalize the labels. |
True
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
test_split_name
|
Optional
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional
|
The name of the validation split. |
None
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
modelgenerator.data.TokenClassificationDataModule
Bases: DataInterface, HFDatasetLoaderMixin
Data module for Hugging Face token classification datasets.
Note
Each sample includes a single sequence under key 'sequences' and a single class sequence under key 'labels'
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
required |
config_name
|
Optional
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
x_col
|
str
|
The name of the column containing the sequences. |
'sequences'
|
y_col
|
str
|
The name of the column containing the labels. |
'labels'
|
extra_cols
|
Optional
|
Additional columns to include in the dataset. |
None
|
rename_cols
|
dict[str, str] | None
|
A dictionary mapping the original column names to the new column names. |
None
|
max_length
|
Optional
|
The maximum length of the sequences. |
None
|
truncate_extra_cols
|
bool
|
Whether to truncate the extra columns to the maximum length. |
False
|
pairwise
|
bool
|
Whether the labels are pairwise. |
False
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
test_split_name
|
Optional
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional
|
The name of the validation split. |
None
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
modelgenerator.data.DiffusionDataModule
Bases: DataInterface, HFDatasetLoaderMixin
Data module for datasets with discrete diffusion-based noising and loss weights from MDLM.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
required |
config_name
|
Optional
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
x_col
|
str
|
The column with the data to train on. |
'sequences'
|
rename_cols
|
dict[str, str] | None
|
A dictionary mapping the original column names to the new column names. |
None
|
timesteps_per_sample
|
int
|
The number of timesteps per sample. |
10
|
randomize_targets
|
bool
|
Whether to randomize the target sequences for each timestep (experimental efficiency boost). |
False
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
test_split_name
|
Optional
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional
|
The name of the validation split. |
None
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
Notes
Each sample includes timesteps_per_sample sequences at different noise levels Each sample's target sequences are under 'target_sequences', the input sequences are under 'sequences', and posterior weights are under 'posterior_weights'
modelgenerator.data.ClassDiffusionDataModule
Bases: SequenceClassificationDataModule
Data module for conditional (or class-filtered) diffusion, and applying discrete diffusion noising.
Note
Each sample includes timesteps_per_sample sequences at different noise levels Each sample's target sequences are under 'target_seqs', the input sequences are under 'input_seqs', and posterior weights are under 'posterior_weights'
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
required |
config_name
|
Optional
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
x_col
|
str
|
The name of the column(s) containing the sequences. |
'sequences'
|
y_col
|
Union
|
The name of the column(s) containing the labels. |
'labels'
|
rename_cols
|
dict[str, str] | None
|
A dictionary mapping the original column names to the new column names. |
None
|
timesteps_per_sample
|
int
|
The number of timesteps per sample. |
10
|
randomize_targets
|
bool
|
Whether to randomize the target sequences for each timestep (experimental efficiency boost). |
False
|
class_filter
|
Union
|
Filter the dataset to only include samples with the specified class(es). |
None
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
test_split_name
|
Optional
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional
|
The name of the validation split. |
None
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
modelgenerator.data.ConditionalDiffusionDataModule
Bases: SequenceRegressionDataModule
Data module for conditional diffusion with a continuous condition, and applying discrete diffusion noising.
Note
Each sample includes timesteps_per_sample sequences at different noise levels Each sample's target sequences are under 'target_seqs', the input sequences are under 'input_seqs', and posterior weights are under 'posterior_weights'
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
required |
config_name
|
Optional
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
x_col
|
str
|
The name of column(s) containing the sequences. |
'sequences'
|
y_col
|
str
|
The name of columns(s) containing the labels. |
'labels'
|
rename_cols
|
dict[str, str] | None
|
A dictionary mapping the original column names to the new column names. |
None
|
normalize
|
bool
|
Whether to normalize the labels. |
True
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
timesteps_per_sample
|
int
|
The number of timesteps per sample. |
10
|
randomize_targets
|
bool
|
Whether to randomize the target sequences for each timestep (experimental efficiency boost). |
False
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
test_split_name
|
Optional
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional
|
The name of the validation split. |
None
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
modelgenerator.data.MLMDataModule
Bases: SequenceClassificationDataModule
Data module for continuing pretraining on a masked language modeling task.
Note
Each sample includes a single sequence under key 'sequences' and a single target sequence under key 'target_sequences'
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier. |
required |
config_name
|
Optional
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
x_col
|
str
|
The name of the column containing the sequences. Defaults to "sequences". |
'sequences'
|
y_col
|
Union
|
The name of the column(s) containing the labels. |
'labels'
|
masking_rate
|
float
|
The masking rate. Defaults to 0.15. |
0.15
|
rename_cols
|
dict[str, str] | None
|
A dictionary mapping the original column names to the new column names. |
None
|
class_filter
|
Union
|
Filter the dataset to only include samples with the specified class(es). |
None
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
train_split_name
|
Optional
|
The name of the training split. |
'train'
|
test_split_name
|
Optional
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional
|
The name of the validation split. |
None
|
train_split_files
|
Union
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Union
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Union
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|