Data
Data modules specify data sources, as well as data loading and preprocessing for use with Tasks.
They provide a simple interface for swapping data sources and re-using datasets for new workflows without any code changes, enabling rapid and reproducible experimentation.
They are specified with the --data
arguent in the CLI or in the data
section of a configuration file.
Data modules can automatically load common data sources (json, tsv, txt, HuggingFace) and uncommon ones (h5ad, TileDB).
They transform, split, and sample these sources for training with mgen fit
, evaluation with mgen test/validate
, and inference with mgen predict
.
This reference overviews the available no-code data modules. If you would like to develop new datasets, see Experiment Design.
data:
class_path: modelgenerator.data.DMSFitnessPrediction
init_args:
path: genbio-ai/ProteinGYM-DMS
train_split_files:
- indels/B1LPA6_ECOSM_Russ_2020_indels.tsv
train_split_name: train
random_seed: 42
batch_size: 32
cv_num_folds: 5
cv_test_fold_id: 0
cv_enable_val_fold: true
cv_fold_id_col: fold_id
model:
...
trainer:
...
Note: Data modules are designed for use with a specific task, indicated in the class name.
DNA
modelgenerator.data.NTClassification
Bases: SequenceClassificationDataModule
Nucleotide Transformer benchmarks from InstaDeep.
Note
- Manuscript: The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics
- Data Card: InstaDeepAI/nucleotide_transformer_downstream_tasks
- Configs:
promoter_all
promoter_tata
promoter_no_tata
enhancers
enhancers_types
splice_sites_all
splice_sites_acceptor
splice_sites_donor
H3
H4
H3K9ac
H3K14ac
H4ac
H3K4me1
H3K4me2
H3K4me3
H3K36me3
H3K79me3
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
'InstaDeepAI/nucleotide_transformer_downstream_tasks'
|
config_name
|
str
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
'enhancers'
|
**kwargs
|
Additional keyword arguments for the parent class. |
{}
|
modelgenerator.data.GUEClassification
Bases: SequenceClassificationDataModule
Genome Understanding Evaluation benchmarks for DNABERT-2 from the Liu Lab at Northwestern.
Note
- Manuscript: DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome
- Data Card: leannmlindsey/GUE
- Configs:
emp_H3
emp_H3K14ac
emp_H3K36me3
emp_H3K4me1
emp_H3K4me2
emp_H3K4me3
emp_H3K79me3
emp_H3K9ac
emp_H4
emp_H4ac
human_tf_0
human_tf_1
human_tf_2
human_tf_3
human_tf_4
mouse_0
mouse_1
mouse_2
mouse_3
mouse_4
prom_300_all
prom_300_notata
prom_300_tata
prom_core_all
prom_core_notata
prom_core_tata
splice_reconstructed
virus_covid
virus_species_40
fungi_species_20
EPI_K562
EPI_HeLa-S3
EPI_NHEK
EPI_IMR90
EPI_HUVEC
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
'leannmlindsey/GUE'
|
config_name
|
str
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
'emp_H3'
|
**kwargs
|
Additional keyword arguments for the parent class. |
{}
|
modelgenerator.data.ClinvarRetrieve
Bases: ZeroshotClassificationRetrieveDataModule
ClinVar dataset for genomic variant effect prediction.
Note
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
None
|
test_split_files
|
List[str]
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
['ClinVar_Processed.tsv']
|
reference_file
|
str
|
The file path to the reference file for retrieving sequences |
'hg38.ml.fa'
|
method
|
str
|
method mode to compute metrics |
'Distance'
|
window
|
int
|
The number of token taken on either side of the mutation site. The processed sequence length is |
512
|
**kwargs
|
Additional keyword arguments passed to the parent class.
|
{}
|
modelgenerator.data.PromoterExpressionRegression
Bases: SequenceRegressionDataModule
Gene expression prediction from promoter sequences from the Regev Lab at the Broad Institute.
Note
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
'genbio-ai/100M-random-promoters'
|
x_col
|
str
|
The name of columns containing the sequences. |
'sequence'
|
y_col
|
str
|
The name of columns containing the labels. |
'label'
|
normalize
|
bool
|
Whether to normalize the labels. |
True
|
**kwargs
|
Additional keyword arguments for the parent class. |
{}
|
modelgenerator.data.PromoterExpressionGeneration
Bases: ConditionalDiffusionDataModule
Promoter generation from gene expression data from the Regev Lab at the Broad Institute.
Note
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
'genbio-ai/100M-random-promoters'
|
x_col
|
str
|
The name of columns containing the sequences. |
'sequence'
|
y_col
|
str
|
The name of columns containing the labels. |
'label'
|
normalize
|
bool
|
Whether to normalize the labels. |
True
|
**kwargs
|
Additional keyword arguments for the parent class. |
{}
|
modelgenerator.data.DependencyMappingDataModule
Bases: SequencesDataModule
Data module for doing dependency mapping via in silico mutagenesis on a dataset of sequences.
Note
Each sample includes a single sequence under key 'sequences' and optionally an 'ids' to track outputs.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
required |
vocab_file
|
str
|
The path to the file with the vocabulary to mutate. |
required |
config_name
|
Optional[str]
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
test_split_name
|
Optional[str]
|
The name of the test split. Also used for |
None
|
test_split_files
|
Optional[str]
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
x_col
|
str
|
The name of the column containing the sequences. Defaults to "sequence". |
'sequence'
|
id_col
|
str
|
The name of the column containing the ids. Defaults to "id". |
'id'
|
**kwargs
|
Additional keyword arguments for the parent class. |
{}
|
RNA
modelgenerator.data.TranslationEfficiency
Bases: SequenceRegressionDataModule
Translation efficiency prediction benchmarks from the Wang Lab at Princeton.
Note
- Manuscript: A 5′ UTR language model for decoding untranslated regions of mRNA and function predictions
- Data Card: genbio-ai/rna-downstream-tasks
- Configs:
translation_efficiency_Muscle
translation_efficiency_HEK
translation_efficiency_pc3
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
'genbio-ai/rna-downstream-tasks'
|
config_name
|
str
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
'translation_efficiency_Muscle'
|
x_col
|
The name of columns containing the sequences. |
'sequences'
|
|
y_col
|
The name of columns containing the labels. |
'labels'
|
|
normalize
|
bool
|
Whether to normalize the labels. |
True
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
10
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_fold_id_col
|
str
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
'fold_id'
|
valid_split_name
|
str
|
The name of the validation split. |
None
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0
|
test_split_name
|
str
|
The name of the test split. Also used for |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0
|
**kwargs
|
Additional keyword arguments for the parent class. |
{}
|
modelgenerator.data.ExpressionLevel
Bases: SequenceRegressionDataModule
Expression level prediction benchmarks from the Wang Lab at Princeton.
Note
- Manuscript: A 5′ UTR language model for decoding untranslated regions of mRNA and function predictions
- Data Card: genbio-ai/rna-downstream-tasks
- Configs:
expression_Muscle
expression_HEK
expression_pc3
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
'genbio-ai/rna-downstream-tasks'
|
config_name
|
str
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
'expression_Muscle'
|
x_col
|
str
|
The name of columns containing the sequences. |
'sequences'
|
y_col
|
str
|
The name of columns containing the labels. |
'labels'
|
normalize
|
bool
|
Whether to normalize the labels. |
True
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
10
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_fold_id_col
|
str
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
'fold_id'
|
valid_split_name
|
str
|
The name of the validation split. |
None
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0
|
test_split_name
|
str
|
The name of the test split. Also used for |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0
|
**kwargs
|
Additional keyword arguments for the parent class. |
{}
|
modelgenerator.data.TranscriptAbundance
Bases: SequenceRegressionDataModule
Transcript abundance prediction benchmarks from the Wang Lab at Princeton.
Note
- Manuscript: A 5′ UTR language model for decoding untranslated regions of mRNA and function predictions
- Data Card: genbio-ai/rna-downstream-tasks
- Configs:
transcript_abundance_athaliana
transcript_abundance_dmelanogaster
transcript_abundance_ecoli
transcript_abundance_hsapiens
transcript_abundance_hvolcanii
transcript_abundance_ppastoris
transcript_abundance_scerevisiae
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
'genbio-ai/rna-downstream-tasks'
|
config_name
|
str
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
'transcript_abundance_athaliana'
|
x_col
|
str
|
The name of columns containing the sequences. |
'sequences'
|
y_col
|
str
|
The name of columns containing the labels. |
'labels'
|
normalize
|
bool
|
Whether to normalize the labels. |
True
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
5
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_fold_id_col
|
str
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
'fold_id'
|
valid_split_name
|
str
|
The name of the validation split. |
None
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0
|
test_split_name
|
str
|
The name of the test split. Also used for |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0
|
**kwargs
|
Additional keyword arguments for the parent class. |
{}
|
modelgenerator.data.ProteinAbundance
Bases: SequenceRegressionDataModule
Protein abundance prediction benchmarks from the Wang Lab at Princeton.
Note
- Manuscript: A 5′ UTR language model for decoding untranslated regions of mRNA and function predictions
- Data Card: genbio-ai/rna-downstream-tasks
- Configs:
protein_abundance_athaliana
protein_abundance_dmelanogaster
protein_abundance_ecoli
protein_abundance_hsapiens
protein_abundance_scerevisiae
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
'genbio-ai/rna-downstream-tasks'
|
config_name
|
str
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
'protein_abundance_athaliana'
|
x_col
|
str
|
The name of columns containing the sequences. |
'sequences'
|
y_col
|
str
|
The name of columns containing the labels. |
'labels'
|
normalize
|
bool
|
Whether to normalize the labels. |
True
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
5
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_fold_id_col
|
str
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
'fold_id'
|
valid_split_name
|
str
|
The name of the validation split. |
None
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0
|
test_split_name
|
str
|
The name of the test split. Also used for |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0
|
**kwargs
|
Additional keyword arguments for the parent class. |
{}
|
modelgenerator.data.NcrnaFamilyClassification
Bases: SequenceClassificationDataModule
Non-coding RNA family classification benchmarks from DPTechnology.
Note
- Manuscript: UNI-RNA: UNIVERSAL PRE-TRAINED MODELS REVOLUTIONIZE RNA RESEARCH
- Data Card: genbio-ai/rna-downstream-tasks
- Configs:
ncrna_family_bnoise0
ncrna_family_bnoise200
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
'genbio-ai/rna-downstream-tasks'
|
config_name
|
str
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
'ncrna_family_bnoise0'
|
x_col
|
str
|
The name of the column containing the sequences. |
'sequences'
|
y_col
|
str
|
The name of the column(s) containing the labels. |
'labels'
|
train_split_name
|
str
|
The name of the training split. |
'train'
|
valid_split_name
|
str
|
The name of the validation split. |
'validation'
|
test_split_name
|
str
|
The name of the test split. Also used for |
'test'
|
**kwargs
|
Additional keyword arguments for the parent class. |
{}
|
modelgenerator.data.SpliceSitePrediction
Bases: SequenceClassificationDataModule
Splice site prediction benchmarks from the Thompson Lab at University of Strasbourg.
Note
- Manuscript: Spliceator: multi-species splice site prediction using convolutional neural networks
- Data Card: genbio-ai/rna-downstream-tasks
- Configs:
splice_site_acceptor
splice_site_donor
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
'genbio-ai/rna-downstream-tasks'
|
config_name
|
str
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
'splice_site_acceptor'
|
x_col
|
str
|
The name of the column containing the sequences. |
'sequences'
|
y_col
|
str
|
The name of the column(s) containing the labels. |
'labels'
|
train_split_name
|
str
|
The name of the training split. |
'train'
|
valid_split_name
|
str
|
The name of the validation split. |
'validation'
|
test_split_name
|
str
|
The name of the test split. Also used for |
'test_danio'
|
batch_size
|
int
|
The batch size. |
16
|
**kwargs
|
Additional keyword arguments for the parent class. |
{}
|
modelgenerator.data.ModificationSitePrediction
Bases: SequenceClassificationDataModule
Modification site prediction benchmarks from the Meng Lab at the University of Liverpool.
Note
- Manuscript: Attention-based multi-label neural networks for integrated prediction and interpretation of twelve widely occurring RNA modifications
- Data Card: genbio-ai/rna-downstream-tasks
- Configs:
modification_site
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
'genbio-ai/rna-downstream-tasks'
|
config_name
|
str
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
'modification_site'
|
x_col
|
str
|
The name of the column containing the sequences. |
'sequences'
|
y_col
|
List[str]
|
The name of the column(s) containing the labels. |
[f'labels_{i}' for i in range(12)]
|
train_split_name
|
str
|
The name of the training split. |
'train'
|
valid_split_name
|
str
|
The name of the validation split. |
'validation'
|
test_split_name
|
str
|
The name of the test split. Also used for |
'test'
|
**kwargs
|
Additional keyword arguments for the parent class. |
{}
|
modelgenerator.data.RNAMeanRibosomeLoadDataModule
Bases: SequenceRegressionDataModule
Data module for the mean ribosome load dataset.
Note
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
'genbio-ai/rna-downstream-tasks'
|
config_name
|
str
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
'mean_ribosome_load'
|
train_split_name
|
str
|
The name of the training split. |
'train'
|
valid_split_name
|
str
|
The name of the validation split. |
'validation'
|
test_split_name
|
str
|
The name of the test split. Also used for |
'test'
|
x_col
|
str
|
The name of columns containing the sequences. |
'utr'
|
y_col
|
str
|
The name of columns containing the labels. |
'rl'
|
extra_cols
|
List[str]
|
Additional columns to include in the dataset. |
None
|
extra_col_aliases
|
List[str]
|
The name of the columns to use as the alias for the extra columns. |
None
|
normalize
|
bool
|
Whether to normalize the labels. |
False
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
**kwargs
|
Additional keyword arguments passed to the parent class. |
{}
|
Protein
modelgenerator.data.ContactPredictionBinary
Bases: TokenClassificationDataModule
Protein contact prediction benchmarks from BioMap.
Note
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
'proteinglm/contact_prediction_binary'
|
pairwise
|
bool
|
Whether the labels are pairwise. |
True
|
x_col
|
str
|
The name of the column containing the sequences. |
'seq'
|
y_col
|
str
|
The name of the column containing the labels. |
'label'
|
batch_size
|
int
|
The batch size. |
1
|
max_context_length
|
int
|
Maximum context length for the input sequences. |
12800
|
msa_random_seed
|
Optional[int]
|
Random seed for MSA generation. |
None
|
is_rag_dataset
|
bool
|
Whether the dataset is a RAG dataset for AIDO.Protein-RAG. |
False
|
**kwargs
|
Additional keyword arguments for the parent class. |
{}
|
modelgenerator.data.SspQ3
Bases: TokenClassificationDataModule
Protein secondary structure prediction benchmarks from BioMap.
Note
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
'proteinglm/ssp_q3'
|
pairwise
|
bool
|
Whether the labels are pairwise. |
False
|
x_col
|
str
|
The name of the column containing the sequences. |
'seq'
|
y_col
|
str
|
The name of the column containing the labels. |
'label'
|
batch_size
|
int
|
The batch size. |
1
|
max_context_length
|
int
|
Maximum context length for the input sequences. |
12800
|
msa_random_seed
|
Optional[int]
|
Random seed for MSA generation. |
None
|
is_rag_dataset
|
bool
|
Whether the dataset is a RAG dataset for AIDO.Protein-RAG. |
False
|
**kwargs
|
Additional keyword arguments for the parent class. |
{}
|
modelgenerator.data.FoldPrediction
Bases: SequenceClassificationDataModule
Protein fold prediction benchmarks from BioMap.
Note
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
'proteinglm/fold_prediction'
|
x_col
|
str
|
The name of the column containing the sequences. |
'seq'
|
y_col
|
str
|
The name of the column(s) containing the labels. |
'label'
|
max_context_length
|
int
|
Maximum context length for the input sequences. |
12800
|
msa_random_seed
|
Optional[int]
|
Random seed for MSA generation. |
None
|
is_rag_dataset
|
bool
|
Whether the dataset is a RAG dataset for AIDO.Protein-RAG. |
False
|
**kwargs
|
Additional keyword arguments for the parent class. |
{}
|
modelgenerator.data.LocalizationPrediction
Bases: SequenceClassificationDataModule
Protein localization prediction benchmarks from BioMap.
Note
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
'proteinglm/localization_prediction'
|
x_col
|
str
|
The name of the column containing the sequences. |
'seq'
|
y_col
|
str
|
The name of the column(s) containing the labels. |
'label'
|
**kwargs
|
Additional keyword arguments for the parent class. |
{}
|
modelgenerator.data.MetalIonBinding
Bases: SequenceClassificationDataModule
Metal ion binding prediction benchmarks from BioMap.
Note
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
'proteinglm/metal_ion_binding'
|
x_col
|
str
|
The name of the column containing the sequences. |
'seq'
|
y_col
|
str
|
The name of the column(s) containing the labels. |
'label'
|
**kwargs
|
Additional keyword arguments for the parent class. |
{}
|
modelgenerator.data.SolubilityPrediction
Bases: SequenceClassificationDataModule
Protein solubility prediction benchmarks from BioMap.
Note
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
'proteinglm/solubility_prediction'
|
x_col
|
str
|
The name of the column containing the sequences. |
'seq'
|
y_col
|
str
|
The name of the column(s) containing the labels. |
'label'
|
**kwargs
|
Additional keyword arguments for the parent class. |
{}
|
modelgenerator.data.AntibioticResistance
Bases: SequenceClassificationDataModule
Antibiotic resistance prediction benchmarks from BioMap.
Note
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
'proteinglm/antibiotic_resistance'
|
x_col
|
str
|
The name of the column containing the sequences. |
'seq'
|
y_col
|
str
|
The name of the column(s) containing the labels. |
'label'
|
**kwargs
|
Additional keyword arguments for the parent class. |
{}
|
modelgenerator.data.CloningClf
Bases: SequenceClassificationDataModule
Cloning classification prediction benchmarks from BioMap.
Note
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
'proteinglm/cloning_clf'
|
x_col
|
str
|
The name of the column containing the sequences. |
'seq'
|
y_col
|
str
|
The name of the column(s) containing the labels. |
'label'
|
**kwargs
|
Additional keyword arguments for the parent class. |
{}
|
modelgenerator.data.MaterialProduction
Bases: SequenceClassificationDataModule
Material production prediction benchmarks from BioMap.
Note
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
'proteinglm/material_production'
|
x_col
|
str
|
The name of the column containing the sequences. |
'seq'
|
y_col
|
str
|
The name of the column(s) containing the labels. |
'label'
|
**kwargs
|
Additional keyword arguments for the parent class. |
{}
|
modelgenerator.data.TcrPmhcAffinity
Bases: SequenceClassificationDataModule
TCR-pMHC affinity prediction benchmarks from BioMap.
Note
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
'proteinglm/tcr_pmhc_affinity'
|
x_col
|
str
|
The name of the column containing the sequences. |
'seq'
|
y_col
|
str
|
The name of the column(s) containing the labels. |
'label'
|
**kwargs
|
Additional keyword arguments for the parent class. |
{}
|
modelgenerator.data.PeptideHlaMhcAffinity
Bases: SequenceClassificationDataModule
Peptide-HLA-MHC affinity prediction benchmarks from BioMap. Note: - Manuscript: xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein - Data Card: proteinglm/peptide_HLA_MHC_affinity
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
'proteinglm/peptide_HLA_MHC_affinity'
|
x_col
|
str
|
The name of the column containing the sequences. |
'seq'
|
y_col
|
str
|
The name of the column(s) containing the labels. |
'label'
|
**kwargs
|
Additional keyword arguments for the parent class. |
{}
|
modelgenerator.data.TemperatureStability
Bases: SequenceClassificationDataModule
Temperature stability prediction benchmarks from BioMap.
Note
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
'proteinglm/temperature_stability'
|
x_col
|
str
|
The name of the column containing the sequences. |
'seq'
|
y_col
|
str
|
The name of the column(s) containing the labels. |
'label'
|
**kwargs
|
Additional keyword arguments for the parent class. |
{}
|
modelgenerator.data.FluorescencePrediction
Bases: SequenceRegressionDataModule
Fluorescence prediction benchmarks from BioMap.
Note
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
'proteinglm/fluorescence_prediction'
|
x_col
|
str
|
The name of columns containing the sequences. |
'seq'
|
y_col
|
str
|
The name of columns containing the labels. |
'label'
|
normalize
|
bool
|
Whether to normalize the labels. |
True
|
max_context_length
|
int
|
Maximum context length for the input sequences. |
12800
|
msa_random_seed
|
Optional[int]
|
Random seed for MSA generation. |
None
|
is_rag_dataset
|
bool
|
Whether the dataset is a RAG dataset for AIDO.Protein-RAG. |
False
|
**kwargs
|
Additional keyword arguments for the parent class. |
{}
|
modelgenerator.data.FitnessPrediction
Bases: SequenceRegressionDataModule
Fitness prediction benchmarks from BioMap.
Note
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
'proteinglm/fitness_prediction'
|
x_col
|
str
|
The name of columns containing the sequences. |
'seq'
|
y_col
|
str
|
The name of columns containing the labels. |
'label'
|
normalize
|
bool
|
Whether to normalize the labels. |
True
|
**kwargs
|
Additional keyword arguments for the parent class. |
{}
|
modelgenerator.data.StabilityPrediction
Bases: SequenceRegressionDataModule
Stability prediction benchmarks from BioMap.
Note
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
'proteinglm/stability_prediction'
|
x_col
|
str
|
The name of columns containing the sequences. |
'seq'
|
y_col
|
str
|
The name of columns containing the labels. |
'label'
|
normalize
|
bool
|
Whether to normalize the labels. |
True
|
**kwargs
|
Additional keyword arguments for the parent class. |
{}
|
modelgenerator.data.EnzymeCatalyticEfficiencyPrediction
Bases: SequenceRegressionDataModule
Enzyme catalytic efficiency prediction benchmarks from BioMap.
Note
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
'proteinglm/enzyme_catalytic_efficiency'
|
x_col
|
str
|
The name of columns containing the sequences. |
'seq'
|
y_col
|
str
|
The name of columns containing the labels. |
'label'
|
normalize
|
bool
|
Whether to normalize the labels. |
True
|
**kwargs
|
Additional keyword arguments for the parent class. |
{}
|
modelgenerator.data.OptimalTemperaturePrediction
Bases: SequenceRegressionDataModule
Optimal temperature prediction benchmarks from BioMap.
Note
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
'proteinglm/optimal_temperature'
|
x_col
|
str
|
The name of columns containing the sequences. |
'seq'
|
y_col
|
str
|
The name of columns containing the labels. |
'label'
|
normalize
|
bool
|
Whether to normalize the labels. |
True
|
**kwargs
|
Additional keyword arguments for the parent class. |
{}
|
modelgenerator.data.OptimalPhPrediction
Bases: SequenceRegressionDataModule
Optimal pH prediction benchmarks from BioMap.
Note
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
'proteinglm/optimal_ph'
|
x_col
|
str
|
The name of columns containing the sequences. |
'seq'
|
y_col
|
str
|
The name of columns containing the labels. |
'label'
|
normalize
|
bool
|
Whether to normalize the labels. |
True
|
**kwargs
|
Additional keyword arguments for the parent class. |
{}
|
modelgenerator.data.DMSFitnessPrediction
Bases: SequenceRegressionDataModule
Deep mutational scanning (DMS) fitness prediction benchmarks from the Gal Lab at Oxford and the Marks Lab at Harvard.
Note
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
'genbio-ai/ProteinGYM-DMS'
|
train_split_files
|
list[str]
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
['indels/B1LPA6_ECOSM_Russ_2020_indels.tsv']
|
x_col
|
str
|
The name of columns containing the sequences. |
'sequences'
|
y_col
|
str
|
The name of columns containing the labels. |
'labels'
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
5
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
str
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
'fold_id'
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
-1
|
valid_split_name
|
str
|
The name of the validation split. |
None
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0
|
test_split_name
|
str
|
The name of the test split. Also used for |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0
|
max_context_length
|
int
|
Maximum context length for the input sequences. |
12800
|
msa_random_seed
|
Optional[int]
|
Random seed for MSA generation. |
None
|
is_rag_dataset
|
bool
|
Whether the dataset is a RAG dataset for AIDO.Protein-RAG. |
False
|
**kwargs
|
Additional keyword arguments for the parent class. |
{}
|
Structure
modelgenerator.data.ContactPredictionBinary
Bases: TokenClassificationDataModule
Protein contact prediction benchmarks from BioMap.
Note
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
'proteinglm/contact_prediction_binary'
|
pairwise
|
bool
|
Whether the labels are pairwise. |
True
|
x_col
|
str
|
The name of the column containing the sequences. |
'seq'
|
y_col
|
str
|
The name of the column containing the labels. |
'label'
|
batch_size
|
int
|
The batch size. |
1
|
max_context_length
|
int
|
Maximum context length for the input sequences. |
12800
|
msa_random_seed
|
Optional[int]
|
Random seed for MSA generation. |
None
|
is_rag_dataset
|
bool
|
Whether the dataset is a RAG dataset for AIDO.Protein-RAG. |
False
|
**kwargs
|
Additional keyword arguments for the parent class. |
{}
|
modelgenerator.data.SspQ3
Bases: TokenClassificationDataModule
Protein secondary structure prediction benchmarks from BioMap.
Note
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
'proteinglm/ssp_q3'
|
pairwise
|
bool
|
Whether the labels are pairwise. |
False
|
x_col
|
str
|
The name of the column containing the sequences. |
'seq'
|
y_col
|
str
|
The name of the column containing the labels. |
'label'
|
batch_size
|
int
|
The batch size. |
1
|
max_context_length
|
int
|
Maximum context length for the input sequences. |
12800
|
msa_random_seed
|
Optional[int]
|
Random seed for MSA generation. |
None
|
is_rag_dataset
|
bool
|
Whether the dataset is a RAG dataset for AIDO.Protein-RAG. |
False
|
**kwargs
|
Additional keyword arguments for the parent class. |
{}
|
modelgenerator.data.FoldPrediction
Bases: SequenceClassificationDataModule
Protein fold prediction benchmarks from BioMap.
Note
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
'proteinglm/fold_prediction'
|
x_col
|
str
|
The name of the column containing the sequences. |
'seq'
|
y_col
|
str
|
The name of the column(s) containing the labels. |
'label'
|
max_context_length
|
int
|
Maximum context length for the input sequences. |
12800
|
msa_random_seed
|
Optional[int]
|
Random seed for MSA generation. |
None
|
is_rag_dataset
|
bool
|
Whether the dataset is a RAG dataset for AIDO.Protein-RAG. |
False
|
**kwargs
|
Additional keyword arguments for the parent class. |
{}
|
modelgenerator.data.FluorescencePrediction
Bases: SequenceRegressionDataModule
Fluorescence prediction benchmarks from BioMap.
Note
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
'proteinglm/fluorescence_prediction'
|
x_col
|
str
|
The name of columns containing the sequences. |
'seq'
|
y_col
|
str
|
The name of columns containing the labels. |
'label'
|
normalize
|
bool
|
Whether to normalize the labels. |
True
|
max_context_length
|
int
|
Maximum context length for the input sequences. |
12800
|
msa_random_seed
|
Optional[int]
|
Random seed for MSA generation. |
None
|
is_rag_dataset
|
bool
|
Whether the dataset is a RAG dataset for AIDO.Protein-RAG. |
False
|
**kwargs
|
Additional keyword arguments for the parent class. |
{}
|
modelgenerator.data.DMSFitnessPrediction
Bases: SequenceRegressionDataModule
Deep mutational scanning (DMS) fitness prediction benchmarks from the Gal Lab at Oxford and the Marks Lab at Harvard.
Note
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
'genbio-ai/ProteinGYM-DMS'
|
train_split_files
|
list[str]
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
['indels/B1LPA6_ECOSM_Russ_2020_indels.tsv']
|
x_col
|
str
|
The name of columns containing the sequences. |
'sequences'
|
y_col
|
str
|
The name of columns containing the labels. |
'labels'
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
5
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
str
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
'fold_id'
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
-1
|
valid_split_name
|
str
|
The name of the validation split. |
None
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0
|
test_split_name
|
str
|
The name of the test split. Also used for |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0
|
max_context_length
|
int
|
Maximum context length for the input sequences. |
12800
|
msa_random_seed
|
Optional[int]
|
Random seed for MSA generation. |
None
|
is_rag_dataset
|
bool
|
Whether the dataset is a RAG dataset for AIDO.Protein-RAG. |
False
|
**kwargs
|
Additional keyword arguments for the parent class. |
{}
|
modelgenerator.data.StructureTokenDataModule
Bases: DataInterface
, HFDatasetLoaderMixin
Test only data module for structure token predictors.
This data module is specifically designed for handling datasets uses amino acid sequences as input and structure tokens as labels.
Note
This module only supports testing and ignores training and validation splits. It assumes test split files contain sequences and optionally their structural token labels. If structural token labels are not provided, dummy labels are created.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
required |
config_name
|
Optional[str]
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
test_split_files
|
Optional[List[str]]
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
batch_size
|
int
|
The batch size. |
1
|
**kwargs
|
Additional keyword arguments passed to the parent class, in which training and validation split settings are overridden so that only the test split is loaded. |
{}
|
Cell
modelgenerator.data.CellClassificationDataModule
Bases: DataInterface
Data module for cell classification.
Note
Each sample includes a feature vector (one of the rows in
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
required |
backbone_class_path
|
Optional[str]
|
Class path of the backbone model. |
None
|
filter_columns
|
Optional[list[str]]
|
The columns of |
None
|
rename_columns
|
Optional[list[str]]
|
New name of columns. Defaults to None, in which case columns are not renamed. Does nothing if filter_colums is None. |
None
|
config_name
|
Optional[str]
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
train_split_name
|
Optional[str]
|
The name of the training split. |
'train'
|
test_split_name
|
Optional[str]
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional[str]
|
The name of the validation split. |
None
|
train_split_files
|
Optional[Union[str, List[str]]]
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Optional[Union[str, List[str]]]
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Optional[Union[str, List[str]]]
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional[dict]
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional[Sampler]
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional[callable]
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional[str]
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
**kwargs
|
Additional keyword arguments passed to the parent class. |
{}
|
modelgenerator.data.CellClassificationLargeDataModule
Bases: DataInterface
Data module for cell classification. This class handles large dataset and is implemented based on TileDB.
Note
Each sample includes a feature vector (one of the rows in
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the TileDB dataset folder |
required |
train_split_subfolder
|
str
|
Subfolder name for the training split. |
required |
valid_split_subfolder
|
str
|
Subfolder name for the validation split. |
required |
test_split_subfolder
|
str
|
Subfolder name for the test split. |
required |
backbone_class_path
|
Optional[str]
|
Class path of the backbone model. |
None
|
layer_name
|
str
|
Name of the layer in the TileDB dataset. |
'data'
|
obs_column_name
|
str
|
Name of the column in |
'cell_type'
|
measurement_name
|
str
|
Name of the measurement in the TileDB dataset. |
'RNA'
|
axis_query_value_filter
|
Optional[str]
|
Optional filter for the axis query. |
None
|
prefetch_factor
|
int
|
Number of batches to prefetch. |
16
|
config_name
|
Optional[str]
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
train_split_name
|
Optional[str]
|
The name of the training split. |
'train'
|
test_split_name
|
Optional[str]
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional[str]
|
The name of the validation split. |
None
|
train_split_files
|
Optional[Union[str, List[str]]]
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Optional[Union[str, List[str]]]
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Optional[Union[str, List[str]]]
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional[dict]
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional[Sampler]
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional[callable]
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional[str]
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
**kwargs
|
Additional keyword arguments passed to the parent class. |
{}
|
modelgenerator.data.ClockDataModule
Bases: DataInterface
Data module for transcriptomic clock tasks.
Note
Each sample includes a feature vector (one of the rows in
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
required |
split_column
|
str
|
The column of |
required |
label_scaling
|
Optional[str]
|
The type of label scaling to apply. |
'z_scaling'
|
backbone_class_path
|
Optional[str]
|
Class path of the backbone model. |
None
|
filter_columns
|
Optional[list[str]]
|
The columns of |
None
|
rename_columns
|
Optional[list[str]]
|
New name of columns. Defaults to None, in which case columns are not renamed. Does nothing if filter_colums is None. |
None
|
config_name
|
Optional[str]
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
train_split_name
|
Optional[str]
|
The name of the training split. |
'train'
|
test_split_name
|
Optional[str]
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional[str]
|
The name of the validation split. |
None
|
train_split_files
|
Optional[Union[str, List[str]]]
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Optional[Union[str, List[str]]]
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Optional[Union[str, List[str]]]
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional[dict]
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional[Sampler]
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional[callable]
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional[str]
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
**kwargs
|
Additional keyword arguments passed to the parent class. |
{}
|
modelgenerator.data.PertClassificationDataModule
Bases: DataInterface
Data module for perturbation classification.
Note
Each sample includes a feature vector (one of the rows in
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
required |
pert_column
|
str
|
Column of |
required |
cell_line_column
|
str
|
Column of |
required |
cell_line
|
str
|
Name of cell line to consider. |
required |
split_seed
|
int
|
Seed for train/val/test splits. |
1234
|
train_frac
|
float
|
Fraction of examples to assign to train set. |
0.7
|
val_frac
|
float
|
Fraction of examples to assign to val set. |
0.15
|
test_frac
|
float
|
Fraction of examples to assign to test set. |
0.15
|
backbone_class_path
|
Optional[str]
|
Class path of the backbone model. |
None
|
filter_columns
|
Optional[list[str]]
|
The columns of |
None
|
rename_columns
|
Optional[list[str]]
|
New name of columns. Defaults to None, in which case columns are not renamed. Does nothing if filter_colums is None. |
None
|
config_name
|
Optional[str]
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
train_split_name
|
Optional[str]
|
The name of the training split. |
'train'
|
test_split_name
|
Optional[str]
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional[str]
|
The name of the validation split. |
None
|
train_split_files
|
Optional[Union[str, List[str]]]
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Optional[Union[str, List[str]]]
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Optional[Union[str, List[str]]]
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional[dict]
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional[Sampler]
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional[callable]
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional[str]
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
**kwargs
|
Additional keyword arguments passed to the parent class. |
{}
|
Tissue
modelgenerator.data.CellWithNeighborDataModule
Bases: DataInterface
Data module for cell classification with neighbors for AIDO.Tissue.
Note
Each sample includes a feature vector (one of the rows in
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
required |
filter_columns
|
Optional[List[str]]
|
The columns of |
None
|
rename_columns
|
Optional[List[str]]
|
Optional list of columns to rename. |
None
|
use_random_neighbor
|
bool
|
Whether to use random neighbors. |
False
|
copy_center_as_neighbor
|
bool
|
Whether to copy center as a neighbor. |
False
|
neighbor_num
|
int
|
Number of neighbors to consider. |
10
|
generate_uid
|
bool
|
Whether to generate a unique identifier. |
False
|
config_name
|
Optional[str]
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
train_split_name
|
Optional[str]
|
The name of the training split. |
'train'
|
test_split_name
|
Optional[str]
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional[str]
|
The name of the validation split. |
None
|
train_split_files
|
Optional[Union[str, List[str]]]
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Optional[Union[str, List[str]]]
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Optional[Union[str, List[str]]]
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional[dict]
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional[Sampler]
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional[callable]
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional[str]
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
**kwargs
|
Additional keyword arguments passed to the parent class. |
{}
|
Multimodal
modelgenerator.data.IsoformExpression
Bases: SequenceRegressionDataModule
Isoform expression prediction benchmarks from the
Note
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
'genbio-ai/transcript_isoform_expression_prediction'
|
config_name
|
str
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
x_col
|
Union[str, list]
|
The name of columns containing the sequences. |
['dna_seq', 'rna_seq', 'protein_seq']
|
valid_split_name
|
The name of the validation split. |
'valid'
|
|
train_split_files
|
Optional[Union[str, list[str]]]
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
'train_*.tsv'
|
test_split_files
|
Optional[Union[str, list[str]]]
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
'test.tsv'
|
valid_split_files
|
Optional[Union[str, list[str]]]
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
'validation.tsv'
|
normalize
|
bool
|
Whether to normalize the labels. |
True
|
**kwargs
|
Additional keyword arguments for the parent class. |
{}
|
Base Classes
modelgenerator.data.DataInterface
Bases: LightningDataModule
, KFoldMixin
Base class for all data modules in this project. Handles the boilerplate of setting up data loaders.
Note
Subclasses must implement the setup method.
All datasets should return a dictionary of data items.
To use HF loading, add the HFDatasetLoaderMixin.
For any task-specific behaviors, implement transformations using torch.utils.data.Dataset
objects.
See MLM for an example.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
required |
config_name
|
Optional[str]
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
train_split_name
|
Optional[str]
|
The name of the training split. |
'train'
|
test_split_name
|
Optional[str]
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional[str]
|
The name of the validation split. |
None
|
train_split_files
|
Optional[Union[str, List[str]]]
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Optional[Union[str, List[str]]]
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Optional[Union[str, List[str]]]
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional[dict]
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional[Sampler]
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional[callable]
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional[str]
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
modelgenerator.data.ColumnRetrievalDataModule
Bases: DataInterface
, HFDatasetLoaderMixin
Simple data module for retrieving and renaming columns from a dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
required |
config_name
|
Optional[str]
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
in_cols
|
List[str]
|
The name of the columns to retrieve. |
[]
|
out_cols
|
Optional[List[str]]
|
The name of the columns to use as the alias for the retrieved columns. |
None
|
train_split_name
|
Optional[str]
|
The name of the training split. |
'train'
|
test_split_name
|
Optional[str]
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional[str]
|
The name of the validation split. |
None
|
train_split_files
|
Optional[Union[str, List[str]]]
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Optional[Union[str, List[str]]]
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Optional[Union[str, List[str]]]
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional[dict]
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional[Sampler]
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional[callable]
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional[str]
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
**kwargs
|
Additional keyword arguments passed to the parent class. |
{}
|
modelgenerator.data.SequencesDataModule
Bases: DataInterface
, HFDatasetLoaderMixin
Data module for loading a simple dataset of sequences.
Note
Each sample includes a single sequence under key 'sequences' and optionally an 'id' to track outputs.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
required |
config_name
|
Optional[str]
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
test_split_name
|
Optional[str]
|
The name of the test split. Also used for |
None
|
test_split_files
|
Optional[str]
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
x_col
|
str
|
The name of the column containing the sequences. |
'sequence'
|
id_col
|
str
|
The name of the column containing the ids. |
'id'
|
**kwargs
|
Additional keyword arguments for the parent class. |
{}
|
modelgenerator.data.SequenceClassificationDataModule
Bases: DataInterface
, HFDatasetLoaderMixin
Data module for Hugging Face sequence classification datasets.
Note
Each sample includes a single sequence under key 'sequences' and a single class label under key 'labels'
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
required |
config_name
|
Optional[str]
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
x_col
|
str
|
The name of the column containing the sequences. |
'sequence'
|
y_col
|
str | List[str]
|
The name of the column(s) containing the labels. |
'label'
|
extra_cols
|
List[str] | None
|
Additional columns to include in the dataset. |
None
|
extra_col_aliases
|
List[str] | None
|
The name of the columns to use as the alias for the extra columns. |
None
|
class_filter
|
int | List[int] | None
|
Filter the dataset to only include samples with the specified class(es). |
None
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
train_split_name
|
Optional[str]
|
The name of the training split. |
'train'
|
test_split_name
|
Optional[str]
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional[str]
|
The name of the validation split. |
None
|
train_split_files
|
Optional[Union[str, List[str]]]
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Optional[Union[str, List[str]]]
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Optional[Union[str, List[str]]]
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional[dict]
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional[Sampler]
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional[callable]
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional[str]
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
**kwargs
|
Additional keyword arguments for the parent class. |
{}
|
modelgenerator.data.SequenceRegressionDataModule
Bases: DataInterface
, HFDatasetLoaderMixin
Data module for sequence regression datasets.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
required |
config_name
|
Optional[str]
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
x_col
|
str
|
The name of columns containing the sequences. |
'sequence'
|
y_col
|
str
|
The name of columns containing the labels. |
'label'
|
extra_cols
|
List[str]
|
Additional columns to include in the dataset. |
None
|
extra_col_aliases
|
List[str]
|
The name of the columns to use as the alias for the extra columns. |
None
|
normalize
|
bool
|
Whether to normalize the labels. |
True
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
train_split_name
|
Optional[str]
|
The name of the training split. |
'train'
|
test_split_name
|
Optional[str]
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional[str]
|
The name of the validation split. |
None
|
train_split_files
|
Optional[Union[str, List[str]]]
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Optional[Union[str, List[str]]]
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Optional[Union[str, List[str]]]
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional[dict]
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional[Sampler]
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional[callable]
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional[str]
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
**kwargs
|
Additional keyword arguments for the parent class. |
{}
|
modelgenerator.data.TokenClassificationDataModule
Bases: DataInterface
, HFDatasetLoaderMixin
Data module for Hugging Face token classification datasets.
Note
Each sample includes a single sequence under key 'sequences' and a single class sequence under key 'labels'
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
required |
config_name
|
Optional[str]
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
x_col
|
str
|
The name of the column containing the sequences. |
'sequence'
|
y_col
|
str
|
The name of the column containing the labels. |
'label'
|
extra_cols
|
List[str] | None
|
Additional columns to include in the dataset. |
None
|
extra_col_aliases
|
List[str] | None
|
The name of the columns to use as the alias for the extra columns. |
None
|
max_length
|
Optional[int]
|
The maximum length of the sequences. |
None
|
truncate_extra_cols
|
bool
|
Whether to truncate the extra columns to the maximum length. |
False
|
pairwise
|
bool
|
Whether the labels are pairwise. |
False
|
collate_fn
|
Optional[callable]
|
The function to use for collating data. |
None
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
train_split_name
|
Optional[str]
|
The name of the training split. |
'train'
|
test_split_name
|
Optional[str]
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional[str]
|
The name of the validation split. |
None
|
train_split_files
|
Optional[Union[str, List[str]]]
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Optional[Union[str, List[str]]]
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Optional[Union[str, List[str]]]
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional[dict]
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional[Sampler]
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional[str]
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
**kwargs
|
Additional keyword arguments for the parent class. |
{}
|
modelgenerator.data.DiffusionDataModule
Bases: DataInterface
, HFDatasetLoaderMixin
Data module for datasets with discrete diffusion-based noising and loss weights from MDLM.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
required |
config_name
|
Optional[str]
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
x_col
|
str
|
The column with the data to train on. |
'sequence'
|
extra_cols
|
List[str] | None
|
Additional columns to include in the dataset. |
None
|
extra_col_aliases
|
List[str] | None
|
The name of the columns to use as the alias for the extra columns. |
None
|
timesteps_per_sample
|
int
|
The number of timesteps per sample. |
10
|
randomize_targets
|
bool
|
Whether to randomize the target sequences for each timestep (experimental efficiency boost). |
False
|
batch_size
|
int
|
The batch size. |
10
|
train_split_name
|
Optional[str]
|
The name of the training split. |
'train'
|
test_split_name
|
Optional[str]
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional[str]
|
The name of the validation split. |
None
|
train_split_files
|
Optional[Union[str, List[str]]]
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Optional[Union[str, List[str]]]
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Optional[Union[str, List[str]]]
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional[dict]
|
Extra kwargs for dataset readers. |
None
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional[Sampler]
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional[callable]
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional[str]
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
**kwargs
|
Additional keyword arguments for the parent class. |
{}
|
Notes
Each sample includes timesteps_per_sample sequences at different noise levels Each sample's target sequences are under 'target_sequences', the input sequences are under 'sequences', and posterior weights are under 'posterior_weights'
modelgenerator.data.ClassDiffusionDataModule
Bases: SequenceClassificationDataModule
Data module for conditional (or class-filtered) diffusion, and applying discrete diffusion noising.
Note
Each sample includes timesteps_per_sample sequences at different noise levels Each sample's target sequences are under 'target_seqs', the input sequences are under 'input_seqs', and posterior weights are under 'posterior_weights'
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
required |
config_name
|
Optional[str]
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
x_col
|
str
|
The name of the column containing the sequences. |
'sequence'
|
y_col
|
str | List[str]
|
The name of the column(s) containing the labels. |
'label'
|
timesteps_per_sample
|
int
|
The number of timesteps per sample. |
10
|
randomize_targets
|
bool
|
Whether to randomize the target sequences for each timestep (experimental efficiency boost). |
False
|
batch_size
|
int
|
The batch size. |
10
|
extra_cols
|
List[str] | None
|
Additional columns to include in the dataset. |
None
|
extra_col_aliases
|
List[str] | None
|
The name of the columns to use as the alias for the extra columns. |
None
|
class_filter
|
int | List[int] | None
|
Filter the dataset to only include samples with the specified class(es). |
None
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
train_split_name
|
Optional[str]
|
The name of the training split. |
'train'
|
test_split_name
|
Optional[str]
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional[str]
|
The name of the validation split. |
None
|
train_split_files
|
Optional[Union[str, List[str]]]
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Optional[Union[str, List[str]]]
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Optional[Union[str, List[str]]]
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional[dict]
|
Extra kwargs for dataset readers. |
None
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional[Sampler]
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional[callable]
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional[str]
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
**kwargs
|
Additional keyword arguments for the parent class. |
{}
|
modelgenerator.data.ConditionalDiffusionDataModule
Bases: SequenceRegressionDataModule
Data module for conditional diffusion with a continuous condition, and applying discrete diffusion noising.
Note
Each sample includes timesteps_per_sample sequences at different noise levels Each sample's target sequences are under 'target_seqs', the input sequences are under 'input_seqs', and posterior weights are under 'posterior_weights'
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
required |
config_name
|
Optional[str]
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
x_col
|
str
|
The name of columns containing the sequences. |
'sequence'
|
y_col
|
str
|
The name of columns containing the labels. |
'label'
|
extra_cols
|
List[str]
|
Additional columns to include in the dataset. |
None
|
extra_col_aliases
|
List[str]
|
The name of the columns to use as the alias for the extra columns. |
None
|
normalize
|
bool
|
Whether to normalize the labels. |
True
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
timesteps_per_sample
|
int
|
The number of timesteps per sample. |
10
|
randomize_targets
|
bool
|
Whether to randomize the target sequences for each timestep (experimental efficiency boost). |
False
|
batch_size
|
int
|
The batch size. |
10
|
train_split_name
|
Optional[str]
|
The name of the training split. |
'train'
|
test_split_name
|
Optional[str]
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional[str]
|
The name of the validation split. |
None
|
train_split_files
|
Optional[Union[str, List[str]]]
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Optional[Union[str, List[str]]]
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Optional[Union[str, List[str]]]
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional[dict]
|
Extra kwargs for dataset readers. |
None
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional[Sampler]
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional[callable]
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional[str]
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
**kwargs
|
Additional keyword arguments for the parent class. |
{}
|
modelgenerator.data.MLMDataModule
Bases: SequenceClassificationDataModule
Data module for continuing pretraining on a masked language modeling task.
Note
Each sample includes a single sequence under key 'sequences' and a single target sequence under key 'target_sequences'
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier |
required |
config_name
|
Optional[str]
|
The name of the HF dataset configuration. Affects how the dataset is loaded. |
None
|
masking_rate
|
float
|
The masking rate. Defaults to 0.15. |
0.15
|
x_col
|
str
|
The name of the column containing the sequences. |
'sequence'
|
y_col
|
str | List[str]
|
The name of the column(s) containing the labels. |
'label'
|
extra_cols
|
List[str] | None
|
Additional columns to include in the dataset. |
None
|
extra_col_aliases
|
List[str] | None
|
The name of the columns to use as the alias for the extra columns. |
None
|
class_filter
|
int | List[int] | None
|
Filter the dataset to only include samples with the specified class(es). |
None
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. |
False
|
train_split_name
|
Optional[str]
|
The name of the training split. |
'train'
|
test_split_name
|
Optional[str]
|
The name of the test split. Also used for |
'test'
|
valid_split_name
|
Optional[str]
|
The name of the validation split. |
None
|
train_split_files
|
Optional[Union[str, List[str]]]
|
Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments. |
None
|
test_split_files
|
Optional[Union[str, List[str]]]
|
Create a split called "test" from these files.
Not used unless referenced by the name "test" in one of the split_name arguments.
Also used for |
None
|
valid_split_files
|
Optional[Union[str, List[str]]]
|
Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments. |
None
|
test_split_size
|
float
|
The size of the test split. If test_split_name is None, creates a test split of this size from the training split. |
0.2
|
valid_split_size
|
float
|
The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split. |
0.1
|
random_seed
|
int
|
The random seed to use for splitting the data. |
42
|
extra_reader_kwargs
|
Optional[dict]
|
Extra kwargs for dataset readers. |
None
|
batch_size
|
int
|
The batch size. |
128
|
shuffle
|
bool
|
Whether to shuffle the data. |
True
|
sampler
|
Optional[Sampler]
|
The sampler to use. |
None
|
num_workers
|
int
|
The number of workers to use for data loading. |
0
|
collate_fn
|
Optional[callable]
|
The function to use for collating data. |
None
|
pin_memory
|
bool
|
Whether to pin memory. |
True
|
persistent_workers
|
bool
|
Whether to use persistent workers. |
False
|
cv_num_folds
|
int
|
The number of cross-validation folds, disables cv when <= 1. |
1
|
cv_test_fold_id
|
int
|
The fold id to use for cross-validation evaluation. |
0
|
cv_enable_val_fold
|
bool
|
Whether to enable a validation fold. |
True
|
cv_replace_val_fold_as_test_fold
|
bool
|
Replace validation fold with test fold. Only used when cv_enable_val_fold is False. |
False
|
cv_fold_id_col
|
Optional[str]
|
The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting. |
None
|
cv_val_offset
|
int
|
The offset applied to cv_test_fold_id to determine val_fold_id. |
1
|
**kwargs
|
Additional keyword arguments for the parent class. |
{}
|