Skip to content

Data

Data modules specify data sources, as well as data loading and preprocessing for use with Tasks. They provide a simple interface for swapping data sources and re-using datasets for new workflows without any code changes, enabling rapid and reproducible experimentation. They are specified with the --data arguent in the CLI or in the data section of a configuration file.

Data modules can automatically load common data sources (json, tsv, txt, HuggingFace) and uncommon ones (h5ad, TileDB). They transform, split, and sample these sources for training with mgen fit, evaluation with mgen test/validate, and inference with mgen predict.

This reference overviews the available no-code data modules. If you would like to develop new datasets, see Experiment Design.

data:
  class_path: modelgenerator.data.DMSFitnessPrediction
  init_args:
    path: genbio-ai/ProteinGYM-DMS
    train_split_files:
    - indels/B1LPA6_ECOSM_Russ_2020_indels.tsv
    train_split_name: train
    random_seed: 42
    batch_size: 32
    cv_num_folds: 5
    cv_test_fold_id: 0
    cv_enable_val_fold: true
    cv_fold_id_col: fold_id
model:
  ...
trainer:
  ...

Note: Data modules are designed for use with a specific task, indicated in the class name.

DNA

modelgenerator.data.NTClassification

Bases: SequenceClassificationDataModule

Nucleotide Transformer benchmarks from InstaDeep.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

'InstaDeepAI/nucleotide_transformer_downstream_tasks'
config_name str

The name of the HF dataset configuration. Affects how the dataset is loaded.

'enhancers'
**kwargs

Additional keyword arguments for the parent class.

{}

modelgenerator.data.GUEClassification

Bases: SequenceClassificationDataModule

Genome Understanding Evaluation benchmarks for DNABERT-2 from the Liu Lab at Northwestern.

Note
  • Manuscript: DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome
  • Data Card: leannmlindsey/GUE
  • Configs:
    • emp_H3
    • emp_H3K14ac
    • emp_H3K36me3
    • emp_H3K4me1
    • emp_H3K4me2
    • emp_H3K4me3
    • emp_H3K79me3
    • emp_H3K9ac
    • emp_H4
    • emp_H4ac
    • human_tf_0
    • human_tf_1
    • human_tf_2
    • human_tf_3
    • human_tf_4
    • mouse_0
    • mouse_1
    • mouse_2
    • mouse_3
    • mouse_4
    • prom_300_all
    • prom_300_notata
    • prom_300_tata
    • prom_core_all
    • prom_core_notata
    • prom_core_tata
    • splice_reconstructed
    • virus_covid
    • virus_species_40
    • fungi_species_20
    • EPI_K562
    • EPI_HeLa-S3
    • EPI_NHEK
    • EPI_IMR90
    • EPI_HUVEC

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

'leannmlindsey/GUE'
config_name str

The name of the HF dataset configuration. Affects how the dataset is loaded.

'emp_H3'
**kwargs

Additional keyword arguments for the parent class.

{}

modelgenerator.data.ClinvarRetrieve

Bases: ZeroshotClassificationRetrieveDataModule

ClinVar dataset for genomic variant effect prediction.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

None
test_split_files List[str]

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

['ClinVar_Processed.tsv']
reference_file str

The file path to the reference file for retrieving sequences

'hg38.ml.fa'
method str

method mode to compute metrics

'Distance'
window int

The number of token taken on either side of the mutation site. The processed sequence length is 2 * window + 1

512
**kwargs

Additional keyword arguments passed to the parent class. train_split_name=None, valid_split_name=None, and valid_split_size=0 are always overridden.

{}

modelgenerator.data.PromoterExpressionRegression

Bases: SequenceRegressionDataModule

Gene expression prediction from promoter sequences from the Regev Lab at the Broad Institute.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

'genbio-ai/100M-random-promoters'
x_col str

The name of columns containing the sequences.

'sequence'
y_col str

The name of columns containing the labels.

'label'
normalize bool

Whether to normalize the labels.

True
**kwargs

Additional keyword arguments for the parent class.

{}

modelgenerator.data.PromoterExpressionGeneration

Bases: ConditionalDiffusionDataModule

Promoter generation from gene expression data from the Regev Lab at the Broad Institute.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

'genbio-ai/100M-random-promoters'
x_col str

The name of columns containing the sequences.

'sequence'
y_col str

The name of columns containing the labels.

'label'
normalize bool

Whether to normalize the labels.

True
**kwargs

Additional keyword arguments for the parent class.

{}

modelgenerator.data.DependencyMappingDataModule

Bases: SequencesDataModule

Data module for doing dependency mapping via in silico mutagenesis on a dataset of sequences.

Note

Each sample includes a single sequence under key 'sequences' and optionally an 'ids' to track outputs.

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

required
vocab_file str

The path to the file with the vocabulary to mutate.

required
config_name Optional[str]

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
test_split_name Optional[str]

The name of the test split. Also used for mgen predict.

None
test_split_files Optional[str]

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
x_col str

The name of the column containing the sequences. Defaults to "sequence".

'sequence'
id_col str

The name of the column containing the ids. Defaults to "id".

'id'
**kwargs

Additional keyword arguments for the parent class.

{}

RNA

modelgenerator.data.TranslationEfficiency

Bases: SequenceRegressionDataModule

Translation efficiency prediction benchmarks from the Wang Lab at Princeton.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

'genbio-ai/rna-downstream-tasks'
config_name str

The name of the HF dataset configuration. Affects how the dataset is loaded.

'translation_efficiency_Muscle'
x_col

The name of columns containing the sequences.

'sequences'
y_col

The name of columns containing the labels.

'labels'
normalize bool

Whether to normalize the labels.

True
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

10
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_fold_id_col str

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

'fold_id'
valid_split_name str

The name of the validation split.

None
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0
test_split_name str

The name of the test split. Also used for mgen predict.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0
**kwargs

Additional keyword arguments for the parent class.

{}

modelgenerator.data.ExpressionLevel

Bases: SequenceRegressionDataModule

Expression level prediction benchmarks from the Wang Lab at Princeton.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

'genbio-ai/rna-downstream-tasks'
config_name str

The name of the HF dataset configuration. Affects how the dataset is loaded.

'expression_Muscle'
x_col str

The name of columns containing the sequences.

'sequences'
y_col str

The name of columns containing the labels.

'labels'
normalize bool

Whether to normalize the labels.

True
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

10
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_fold_id_col str

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

'fold_id'
valid_split_name str

The name of the validation split.

None
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0
test_split_name str

The name of the test split. Also used for mgen predict.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0
**kwargs

Additional keyword arguments for the parent class.

{}

modelgenerator.data.TranscriptAbundance

Bases: SequenceRegressionDataModule

Transcript abundance prediction benchmarks from the Wang Lab at Princeton.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

'genbio-ai/rna-downstream-tasks'
config_name str

The name of the HF dataset configuration. Affects how the dataset is loaded.

'transcript_abundance_athaliana'
x_col str

The name of columns containing the sequences.

'sequences'
y_col str

The name of columns containing the labels.

'labels'
normalize bool

Whether to normalize the labels.

True
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

5
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_fold_id_col str

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

'fold_id'
valid_split_name str

The name of the validation split.

None
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0
test_split_name str

The name of the test split. Also used for mgen predict.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0
**kwargs

Additional keyword arguments for the parent class.

{}

modelgenerator.data.ProteinAbundance

Bases: SequenceRegressionDataModule

Protein abundance prediction benchmarks from the Wang Lab at Princeton.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

'genbio-ai/rna-downstream-tasks'
config_name str

The name of the HF dataset configuration. Affects how the dataset is loaded.

'protein_abundance_athaliana'
x_col str

The name of columns containing the sequences.

'sequences'
y_col str

The name of columns containing the labels.

'labels'
normalize bool

Whether to normalize the labels.

True
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

5
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_fold_id_col str

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

'fold_id'
valid_split_name str

The name of the validation split.

None
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0
test_split_name str

The name of the test split. Also used for mgen predict.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0
**kwargs

Additional keyword arguments for the parent class.

{}

modelgenerator.data.NcrnaFamilyClassification

Bases: SequenceClassificationDataModule

Non-coding RNA family classification benchmarks from DPTechnology.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

'genbio-ai/rna-downstream-tasks'
config_name str

The name of the HF dataset configuration. Affects how the dataset is loaded.

'ncrna_family_bnoise0'
x_col str

The name of the column containing the sequences.

'sequences'
y_col str

The name of the column(s) containing the labels.

'labels'
train_split_name str

The name of the training split.

'train'
valid_split_name str

The name of the validation split.

'validation'
test_split_name str

The name of the test split. Also used for mgen predict.

'test'
**kwargs

Additional keyword arguments for the parent class.

{}

modelgenerator.data.SpliceSitePrediction

Bases: SequenceClassificationDataModule

Splice site prediction benchmarks from the Thompson Lab at University of Strasbourg.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

'genbio-ai/rna-downstream-tasks'
config_name str

The name of the HF dataset configuration. Affects how the dataset is loaded.

'splice_site_acceptor'
x_col str

The name of the column containing the sequences.

'sequences'
y_col str

The name of the column(s) containing the labels.

'labels'
train_split_name str

The name of the training split.

'train'
valid_split_name str

The name of the validation split.

'validation'
test_split_name str

The name of the test split. Also used for mgen predict.

'test_danio'
batch_size int

The batch size.

16
**kwargs

Additional keyword arguments for the parent class.

{}

modelgenerator.data.ModificationSitePrediction

Bases: SequenceClassificationDataModule

Modification site prediction benchmarks from the Meng Lab at the University of Liverpool.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

'genbio-ai/rna-downstream-tasks'
config_name str

The name of the HF dataset configuration. Affects how the dataset is loaded.

'modification_site'
x_col str

The name of the column containing the sequences.

'sequences'
y_col List[str]

The name of the column(s) containing the labels.

[f'labels_{i}' for i in range(12)]
train_split_name str

The name of the training split.

'train'
valid_split_name str

The name of the validation split.

'validation'
test_split_name str

The name of the test split. Also used for mgen predict.

'test'
**kwargs

Additional keyword arguments for the parent class.

{}

modelgenerator.data.RNAMeanRibosomeLoadDataModule

Bases: SequenceRegressionDataModule

Data module for the mean ribosome load dataset.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

'genbio-ai/rna-downstream-tasks'
config_name str

The name of the HF dataset configuration. Affects how the dataset is loaded.

'mean_ribosome_load'
train_split_name str

The name of the training split.

'train'
valid_split_name str

The name of the validation split.

'validation'
test_split_name str

The name of the test split. Also used for mgen predict.

'test'
x_col str

The name of columns containing the sequences.

'utr'
y_col str

The name of columns containing the labels.

'rl'
extra_cols List[str]

Additional columns to include in the dataset.

None
extra_col_aliases List[str]

The name of the columns to use as the alias for the extra columns.

None
normalize bool

Whether to normalize the labels.

False
generate_uid bool

Whether to generate a unique ID for each sample.

False
**kwargs

Additional keyword arguments passed to the parent class.

{}

Protein

modelgenerator.data.ContactPredictionBinary

Bases: TokenClassificationDataModule

Protein contact prediction benchmarks from BioMap.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

'proteinglm/contact_prediction_binary'
pairwise bool

Whether the labels are pairwise.

True
x_col str

The name of the column containing the sequences.

'seq'
y_col str

The name of the column containing the labels.

'label'
batch_size int

The batch size.

1
max_context_length int

Maximum context length for the input sequences.

12800
msa_random_seed Optional[int]

Random seed for MSA generation.

None
is_rag_dataset bool

Whether the dataset is a RAG dataset for AIDO.Protein-RAG.

False
**kwargs

Additional keyword arguments for the parent class.

{}

modelgenerator.data.SspQ3

Bases: TokenClassificationDataModule

Protein secondary structure prediction benchmarks from BioMap.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

'proteinglm/ssp_q3'
pairwise bool

Whether the labels are pairwise.

False
x_col str

The name of the column containing the sequences.

'seq'
y_col str

The name of the column containing the labels.

'label'
batch_size int

The batch size.

1
max_context_length int

Maximum context length for the input sequences.

12800
msa_random_seed Optional[int]

Random seed for MSA generation.

None
is_rag_dataset bool

Whether the dataset is a RAG dataset for AIDO.Protein-RAG.

False
**kwargs

Additional keyword arguments for the parent class.

{}

modelgenerator.data.FoldPrediction

Bases: SequenceClassificationDataModule

Protein fold prediction benchmarks from BioMap.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

'proteinglm/fold_prediction'
x_col str

The name of the column containing the sequences.

'seq'
y_col str

The name of the column(s) containing the labels.

'label'
max_context_length int

Maximum context length for the input sequences.

12800
msa_random_seed Optional[int]

Random seed for MSA generation.

None
is_rag_dataset bool

Whether the dataset is a RAG dataset for AIDO.Protein-RAG.

False
**kwargs

Additional keyword arguments for the parent class.

{}

modelgenerator.data.LocalizationPrediction

Bases: SequenceClassificationDataModule

Protein localization prediction benchmarks from BioMap.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

'proteinglm/localization_prediction'
x_col str

The name of the column containing the sequences.

'seq'
y_col str

The name of the column(s) containing the labels.

'label'
**kwargs

Additional keyword arguments for the parent class.

{}

modelgenerator.data.MetalIonBinding

Bases: SequenceClassificationDataModule

Metal ion binding prediction benchmarks from BioMap.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

'proteinglm/metal_ion_binding'
x_col str

The name of the column containing the sequences.

'seq'
y_col str

The name of the column(s) containing the labels.

'label'
**kwargs

Additional keyword arguments for the parent class.

{}

modelgenerator.data.SolubilityPrediction

Bases: SequenceClassificationDataModule

Protein solubility prediction benchmarks from BioMap.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

'proteinglm/solubility_prediction'
x_col str

The name of the column containing the sequences.

'seq'
y_col str

The name of the column(s) containing the labels.

'label'
**kwargs

Additional keyword arguments for the parent class.

{}

modelgenerator.data.AntibioticResistance

Bases: SequenceClassificationDataModule

Antibiotic resistance prediction benchmarks from BioMap.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

'proteinglm/antibiotic_resistance'
x_col str

The name of the column containing the sequences.

'seq'
y_col str

The name of the column(s) containing the labels.

'label'
**kwargs

Additional keyword arguments for the parent class.

{}

modelgenerator.data.CloningClf

Bases: SequenceClassificationDataModule

Cloning classification prediction benchmarks from BioMap.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

'proteinglm/cloning_clf'
x_col str

The name of the column containing the sequences.

'seq'
y_col str

The name of the column(s) containing the labels.

'label'
**kwargs

Additional keyword arguments for the parent class.

{}

modelgenerator.data.MaterialProduction

Bases: SequenceClassificationDataModule

Material production prediction benchmarks from BioMap.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

'proteinglm/material_production'
x_col str

The name of the column containing the sequences.

'seq'
y_col str

The name of the column(s) containing the labels.

'label'
**kwargs

Additional keyword arguments for the parent class.

{}

modelgenerator.data.TcrPmhcAffinity

Bases: SequenceClassificationDataModule

TCR-pMHC affinity prediction benchmarks from BioMap.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

'proteinglm/tcr_pmhc_affinity'
x_col str

The name of the column containing the sequences.

'seq'
y_col str

The name of the column(s) containing the labels.

'label'
**kwargs

Additional keyword arguments for the parent class.

{}

modelgenerator.data.PeptideHlaMhcAffinity

Bases: SequenceClassificationDataModule

Peptide-HLA-MHC affinity prediction benchmarks from BioMap. Note: - Manuscript: xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein - Data Card: proteinglm/peptide_HLA_MHC_affinity

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

'proteinglm/peptide_HLA_MHC_affinity'
x_col str

The name of the column containing the sequences.

'seq'
y_col str

The name of the column(s) containing the labels.

'label'
**kwargs

Additional keyword arguments for the parent class.

{}

modelgenerator.data.TemperatureStability

Bases: SequenceClassificationDataModule

Temperature stability prediction benchmarks from BioMap.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

'proteinglm/temperature_stability'
x_col str

The name of the column containing the sequences.

'seq'
y_col str

The name of the column(s) containing the labels.

'label'
**kwargs

Additional keyword arguments for the parent class.

{}

modelgenerator.data.FluorescencePrediction

Bases: SequenceRegressionDataModule

Fluorescence prediction benchmarks from BioMap.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

'proteinglm/fluorescence_prediction'
x_col str

The name of columns containing the sequences.

'seq'
y_col str

The name of columns containing the labels.

'label'
normalize bool

Whether to normalize the labels.

True
max_context_length int

Maximum context length for the input sequences.

12800
msa_random_seed Optional[int]

Random seed for MSA generation.

None
is_rag_dataset bool

Whether the dataset is a RAG dataset for AIDO.Protein-RAG.

False
**kwargs

Additional keyword arguments for the parent class.

{}

modelgenerator.data.FitnessPrediction

Bases: SequenceRegressionDataModule

Fitness prediction benchmarks from BioMap.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

'proteinglm/fitness_prediction'
x_col str

The name of columns containing the sequences.

'seq'
y_col str

The name of columns containing the labels.

'label'
normalize bool

Whether to normalize the labels.

True
**kwargs

Additional keyword arguments for the parent class.

{}

modelgenerator.data.StabilityPrediction

Bases: SequenceRegressionDataModule

Stability prediction benchmarks from BioMap.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

'proteinglm/stability_prediction'
x_col str

The name of columns containing the sequences.

'seq'
y_col str

The name of columns containing the labels.

'label'
normalize bool

Whether to normalize the labels.

True
**kwargs

Additional keyword arguments for the parent class.

{}

modelgenerator.data.EnzymeCatalyticEfficiencyPrediction

Bases: SequenceRegressionDataModule

Enzyme catalytic efficiency prediction benchmarks from BioMap.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

'proteinglm/enzyme_catalytic_efficiency'
x_col str

The name of columns containing the sequences.

'seq'
y_col str

The name of columns containing the labels.

'label'
normalize bool

Whether to normalize the labels.

True
**kwargs

Additional keyword arguments for the parent class.

{}

modelgenerator.data.OptimalTemperaturePrediction

Bases: SequenceRegressionDataModule

Optimal temperature prediction benchmarks from BioMap.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

'proteinglm/optimal_temperature'
x_col str

The name of columns containing the sequences.

'seq'
y_col str

The name of columns containing the labels.

'label'
normalize bool

Whether to normalize the labels.

True
**kwargs

Additional keyword arguments for the parent class.

{}

modelgenerator.data.OptimalPhPrediction

Bases: SequenceRegressionDataModule

Optimal pH prediction benchmarks from BioMap.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

'proteinglm/optimal_ph'
x_col str

The name of columns containing the sequences.

'seq'
y_col str

The name of columns containing the labels.

'label'
normalize bool

Whether to normalize the labels.

True
**kwargs

Additional keyword arguments for the parent class.

{}

modelgenerator.data.DMSFitnessPrediction

Bases: SequenceRegressionDataModule

Deep mutational scanning (DMS) fitness prediction benchmarks from the Gal Lab at Oxford and the Marks Lab at Harvard.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

'genbio-ai/ProteinGYM-DMS'
train_split_files list[str]

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

['indels/B1LPA6_ECOSM_Russ_2020_indels.tsv']
x_col str

The name of columns containing the sequences.

'sequences'
y_col str

The name of columns containing the labels.

'labels'
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

5
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col str

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

'fold_id'
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

-1
valid_split_name str

The name of the validation split.

None
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0
test_split_name str

The name of the test split. Also used for mgen predict.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0
max_context_length int

Maximum context length for the input sequences.

12800
msa_random_seed Optional[int]

Random seed for MSA generation.

None
is_rag_dataset bool

Whether the dataset is a RAG dataset for AIDO.Protein-RAG.

False
**kwargs

Additional keyword arguments for the parent class.

{}

Structure

modelgenerator.data.ContactPredictionBinary

Bases: TokenClassificationDataModule

Protein contact prediction benchmarks from BioMap.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

'proteinglm/contact_prediction_binary'
pairwise bool

Whether the labels are pairwise.

True
x_col str

The name of the column containing the sequences.

'seq'
y_col str

The name of the column containing the labels.

'label'
batch_size int

The batch size.

1
max_context_length int

Maximum context length for the input sequences.

12800
msa_random_seed Optional[int]

Random seed for MSA generation.

None
is_rag_dataset bool

Whether the dataset is a RAG dataset for AIDO.Protein-RAG.

False
**kwargs

Additional keyword arguments for the parent class.

{}

modelgenerator.data.SspQ3

Bases: TokenClassificationDataModule

Protein secondary structure prediction benchmarks from BioMap.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

'proteinglm/ssp_q3'
pairwise bool

Whether the labels are pairwise.

False
x_col str

The name of the column containing the sequences.

'seq'
y_col str

The name of the column containing the labels.

'label'
batch_size int

The batch size.

1
max_context_length int

Maximum context length for the input sequences.

12800
msa_random_seed Optional[int]

Random seed for MSA generation.

None
is_rag_dataset bool

Whether the dataset is a RAG dataset for AIDO.Protein-RAG.

False
**kwargs

Additional keyword arguments for the parent class.

{}

modelgenerator.data.FoldPrediction

Bases: SequenceClassificationDataModule

Protein fold prediction benchmarks from BioMap.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

'proteinglm/fold_prediction'
x_col str

The name of the column containing the sequences.

'seq'
y_col str

The name of the column(s) containing the labels.

'label'
max_context_length int

Maximum context length for the input sequences.

12800
msa_random_seed Optional[int]

Random seed for MSA generation.

None
is_rag_dataset bool

Whether the dataset is a RAG dataset for AIDO.Protein-RAG.

False
**kwargs

Additional keyword arguments for the parent class.

{}

modelgenerator.data.FluorescencePrediction

Bases: SequenceRegressionDataModule

Fluorescence prediction benchmarks from BioMap.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

'proteinglm/fluorescence_prediction'
x_col str

The name of columns containing the sequences.

'seq'
y_col str

The name of columns containing the labels.

'label'
normalize bool

Whether to normalize the labels.

True
max_context_length int

Maximum context length for the input sequences.

12800
msa_random_seed Optional[int]

Random seed for MSA generation.

None
is_rag_dataset bool

Whether the dataset is a RAG dataset for AIDO.Protein-RAG.

False
**kwargs

Additional keyword arguments for the parent class.

{}

modelgenerator.data.DMSFitnessPrediction

Bases: SequenceRegressionDataModule

Deep mutational scanning (DMS) fitness prediction benchmarks from the Gal Lab at Oxford and the Marks Lab at Harvard.

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

'genbio-ai/ProteinGYM-DMS'
train_split_files list[str]

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

['indels/B1LPA6_ECOSM_Russ_2020_indels.tsv']
x_col str

The name of columns containing the sequences.

'sequences'
y_col str

The name of columns containing the labels.

'labels'
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

5
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col str

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

'fold_id'
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

-1
valid_split_name str

The name of the validation split.

None
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0
test_split_name str

The name of the test split. Also used for mgen predict.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0
max_context_length int

Maximum context length for the input sequences.

12800
msa_random_seed Optional[int]

Random seed for MSA generation.

None
is_rag_dataset bool

Whether the dataset is a RAG dataset for AIDO.Protein-RAG.

False
**kwargs

Additional keyword arguments for the parent class.

{}

modelgenerator.data.StructureTokenDataModule

Bases: DataInterface, HFDatasetLoaderMixin

Test only data module for structure token predictors.

This data module is specifically designed for handling datasets uses amino acid sequences as input and structure tokens as labels.

Note

This module only supports testing and ignores training and validation splits. It assumes test split files contain sequences and optionally their structural token labels. If structural token labels are not provided, dummy labels are created.

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

required
config_name Optional[str]

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
test_split_files Optional[List[str]]

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
batch_size int

The batch size.

1
**kwargs

Additional keyword arguments passed to the parent class, in which training and validation split settings are overridden so that only the test split is loaded.

{}

Cell

modelgenerator.data.CellClassificationDataModule

Bases: DataInterface

Data module for cell classification.

Note

Each sample includes a feature vector (one of the rows in ) and a single class label (one of the columns in )

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

required
backbone_class_path Optional[str]

Class path of the backbone model.

None
filter_columns Optional[list[str]]

The columns of we want to use. Defaults to None, in which case all columns are used.

None
rename_columns Optional[list[str]]

New name of columns. Defaults to None, in which case columns are not renamed. Does nothing if filter_colums is None.

None
config_name Optional[str]

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
train_split_name Optional[str]

The name of the training split.

'train'
test_split_name Optional[str]

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional[str]

The name of the validation split.

None
train_split_files Optional[Union[str, List[str]]]

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Optional[Union[str, List[str]]]

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Optional[Union[str, List[str]]]

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional[dict]

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional[Sampler]

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional[callable]

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional[str]

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1
**kwargs

Additional keyword arguments passed to the parent class.

{}

modelgenerator.data.CellClassificationLargeDataModule

Bases: DataInterface

Data module for cell classification. This class handles large dataset and is implemented based on TileDB.

Note

Each sample includes a feature vector (one of the rows in ) and a single class label (one of the columns in )

Parameters:

Name Type Description Default
path str

Path to the TileDB dataset folder

required
train_split_subfolder str

Subfolder name for the training split.

required
valid_split_subfolder str

Subfolder name for the validation split.

required
test_split_subfolder str

Subfolder name for the test split.

required
backbone_class_path Optional[str]

Class path of the backbone model.

None
layer_name str

Name of the layer in the TileDB dataset.

'data'
obs_column_name str

Name of the column in to use as the label.

'cell_type'
measurement_name str

Name of the measurement in the TileDB dataset.

'RNA'
axis_query_value_filter Optional[str]

Optional filter for the axis query.

None
prefetch_factor int

Number of batches to prefetch.

16
config_name Optional[str]

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
train_split_name Optional[str]

The name of the training split.

'train'
test_split_name Optional[str]

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional[str]

The name of the validation split.

None
train_split_files Optional[Union[str, List[str]]]

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Optional[Union[str, List[str]]]

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Optional[Union[str, List[str]]]

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional[dict]

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional[Sampler]

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional[callable]

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional[str]

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1
**kwargs

Additional keyword arguments passed to the parent class.

{}

modelgenerator.data.ClockDataModule

Bases: DataInterface

Data module for transcriptomic clock tasks.

Note

Each sample includes a feature vector (one of the rows in ) and a single scalar corresponding to donor age (one of the columns in )

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

required
split_column str

The column of that defines the split assignments.

required
label_scaling Optional[str]

The type of label scaling to apply.

'z_scaling'
backbone_class_path Optional[str]

Class path of the backbone model.

None
filter_columns Optional[list[str]]

The columns of we want to use. Defaults to None, in which case all columns are used.

None
rename_columns Optional[list[str]]

New name of columns. Defaults to None, in which case columns are not renamed. Does nothing if filter_colums is None.

None
config_name Optional[str]

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
train_split_name Optional[str]

The name of the training split.

'train'
test_split_name Optional[str]

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional[str]

The name of the validation split.

None
train_split_files Optional[Union[str, List[str]]]

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Optional[Union[str, List[str]]]

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Optional[Union[str, List[str]]]

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional[dict]

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional[Sampler]

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional[callable]

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional[str]

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1
**kwargs

Additional keyword arguments passed to the parent class.

{}

modelgenerator.data.PertClassificationDataModule

Bases: DataInterface

Data module for perturbation classification.

Note

Each sample includes a feature vector (one of the rows in ) and a single class label (one of the columns in )

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

required
pert_column str

Column of containing perturbation labels.

required
cell_line_column str

Column of containing cell line labels.

required
cell_line str

Name of cell line to consider.

required
split_seed int

Seed for train/val/test splits.

1234
train_frac float

Fraction of examples to assign to train set.

0.7
val_frac float

Fraction of examples to assign to val set.

0.15
test_frac float

Fraction of examples to assign to test set.

0.15
backbone_class_path Optional[str]

Class path of the backbone model.

None
filter_columns Optional[list[str]]

The columns of we want to use. Defaults to None, in which case all columns are used.

None
rename_columns Optional[list[str]]

New name of columns. Defaults to None, in which case columns are not renamed. Does nothing if filter_colums is None.

None
config_name Optional[str]

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
train_split_name Optional[str]

The name of the training split.

'train'
test_split_name Optional[str]

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional[str]

The name of the validation split.

None
train_split_files Optional[Union[str, List[str]]]

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Optional[Union[str, List[str]]]

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Optional[Union[str, List[str]]]

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional[dict]

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional[Sampler]

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional[callable]

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional[str]

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1
**kwargs

Additional keyword arguments passed to the parent class.

{}

Tissue

modelgenerator.data.CellWithNeighborDataModule

Bases: DataInterface

Data module for cell classification with neighbors for AIDO.Tissue.

Note

Each sample includes a feature vector (one of the rows in ) and a single class label (one of the columns in ) The feature vector is concatenated with the feature vectors of its neighbors.

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

required
filter_columns Optional[List[str]]

The columns of we want to use. Defaults to None, in which case all columns are used.

None
rename_columns Optional[List[str]]

Optional list of columns to rename.

None
use_random_neighbor bool

Whether to use random neighbors.

False
copy_center_as_neighbor bool

Whether to copy center as a neighbor.

False
neighbor_num int

Number of neighbors to consider.

10
generate_uid bool

Whether to generate a unique identifier.

False
config_name Optional[str]

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
train_split_name Optional[str]

The name of the training split.

'train'
test_split_name Optional[str]

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional[str]

The name of the validation split.

None
train_split_files Optional[Union[str, List[str]]]

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Optional[Union[str, List[str]]]

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Optional[Union[str, List[str]]]

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional[dict]

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional[Sampler]

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional[callable]

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional[str]

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1
**kwargs

Additional keyword arguments passed to the parent class.

{}

Multimodal

modelgenerator.data.IsoformExpression

Bases: SequenceRegressionDataModule

Isoform expression prediction benchmarks from the

Note

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

'genbio-ai/transcript_isoform_expression_prediction'
config_name str

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
x_col Union[str, list]

The name of columns containing the sequences.

['dna_seq', 'rna_seq', 'protein_seq']
valid_split_name

The name of the validation split.

'valid'
train_split_files Optional[Union[str, list[str]]]

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

'train_*.tsv'
test_split_files Optional[Union[str, list[str]]]

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

'test.tsv'
valid_split_files Optional[Union[str, list[str]]]

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

'validation.tsv'
normalize bool

Whether to normalize the labels.

True
**kwargs

Additional keyword arguments for the parent class.

{}

Base Classes

modelgenerator.data.DataInterface

Bases: LightningDataModule, KFoldMixin

Base class for all data modules in this project. Handles the boilerplate of setting up data loaders.

Note

Subclasses must implement the setup method. All datasets should return a dictionary of data items. To use HF loading, add the HFDatasetLoaderMixin. For any task-specific behaviors, implement transformations using torch.utils.data.Dataset objects. See MLM for an example.

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

required
config_name Optional[str]

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
train_split_name Optional[str]

The name of the training split.

'train'
test_split_name Optional[str]

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional[str]

The name of the validation split.

None
train_split_files Optional[Union[str, List[str]]]

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Optional[Union[str, List[str]]]

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Optional[Union[str, List[str]]]

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional[dict]

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional[Sampler]

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional[callable]

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional[str]

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1

modelgenerator.data.ColumnRetrievalDataModule

Bases: DataInterface, HFDatasetLoaderMixin

Simple data module for retrieving and renaming columns from a dataset.

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

required
config_name Optional[str]

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
in_cols List[str]

The name of the columns to retrieve.

[]
out_cols Optional[List[str]]

The name of the columns to use as the alias for the retrieved columns.

None
train_split_name Optional[str]

The name of the training split.

'train'
test_split_name Optional[str]

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional[str]

The name of the validation split.

None
train_split_files Optional[Union[str, List[str]]]

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Optional[Union[str, List[str]]]

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Optional[Union[str, List[str]]]

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional[dict]

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional[Sampler]

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional[callable]

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional[str]

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1
**kwargs

Additional keyword arguments passed to the parent class.

{}

modelgenerator.data.SequencesDataModule

Bases: DataInterface, HFDatasetLoaderMixin

Data module for loading a simple dataset of sequences.

Note

Each sample includes a single sequence under key 'sequences' and optionally an 'id' to track outputs.

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

required
config_name Optional[str]

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
test_split_name Optional[str]

The name of the test split. Also used for mgen predict.

None
test_split_files Optional[str]

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
x_col str

The name of the column containing the sequences.

'sequence'
id_col str

The name of the column containing the ids.

'id'
**kwargs

Additional keyword arguments for the parent class.

{}

modelgenerator.data.SequenceClassificationDataModule

Bases: DataInterface, HFDatasetLoaderMixin

Data module for Hugging Face sequence classification datasets.

Note

Each sample includes a single sequence under key 'sequences' and a single class label under key 'labels'

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

required
config_name Optional[str]

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
x_col str

The name of the column containing the sequences.

'sequence'
y_col str | List[str]

The name of the column(s) containing the labels.

'label'
extra_cols List[str] | None

Additional columns to include in the dataset.

None
extra_col_aliases List[str] | None

The name of the columns to use as the alias for the extra columns.

None
class_filter int | List[int] | None

Filter the dataset to only include samples with the specified class(es).

None
generate_uid bool

Whether to generate a unique ID for each sample.

False
train_split_name Optional[str]

The name of the training split.

'train'
test_split_name Optional[str]

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional[str]

The name of the validation split.

None
train_split_files Optional[Union[str, List[str]]]

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Optional[Union[str, List[str]]]

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Optional[Union[str, List[str]]]

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional[dict]

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional[Sampler]

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional[callable]

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional[str]

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1
**kwargs

Additional keyword arguments for the parent class.

{}

modelgenerator.data.SequenceRegressionDataModule

Bases: DataInterface, HFDatasetLoaderMixin

Data module for sequence regression datasets.

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

required
config_name Optional[str]

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
x_col str

The name of columns containing the sequences.

'sequence'
y_col str

The name of columns containing the labels.

'label'
extra_cols List[str]

Additional columns to include in the dataset.

None
extra_col_aliases List[str]

The name of the columns to use as the alias for the extra columns.

None
normalize bool

Whether to normalize the labels.

True
generate_uid bool

Whether to generate a unique ID for each sample.

False
train_split_name Optional[str]

The name of the training split.

'train'
test_split_name Optional[str]

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional[str]

The name of the validation split.

None
train_split_files Optional[Union[str, List[str]]]

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Optional[Union[str, List[str]]]

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Optional[Union[str, List[str]]]

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional[dict]

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional[Sampler]

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional[callable]

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional[str]

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1
**kwargs

Additional keyword arguments for the parent class.

{}

modelgenerator.data.TokenClassificationDataModule

Bases: DataInterface, HFDatasetLoaderMixin

Data module for Hugging Face token classification datasets.

Note

Each sample includes a single sequence under key 'sequences' and a single class sequence under key 'labels'

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

required
config_name Optional[str]

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
x_col str

The name of the column containing the sequences.

'sequence'
y_col str

The name of the column containing the labels.

'label'
extra_cols List[str] | None

Additional columns to include in the dataset.

None
extra_col_aliases List[str] | None

The name of the columns to use as the alias for the extra columns.

None
max_length Optional[int]

The maximum length of the sequences.

None
truncate_extra_cols bool

Whether to truncate the extra columns to the maximum length.

False
pairwise bool

Whether the labels are pairwise.

False
collate_fn Optional[callable]

The function to use for collating data.

None
generate_uid bool

Whether to generate a unique ID for each sample.

False
train_split_name Optional[str]

The name of the training split.

'train'
test_split_name Optional[str]

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional[str]

The name of the validation split.

None
train_split_files Optional[Union[str, List[str]]]

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Optional[Union[str, List[str]]]

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Optional[Union[str, List[str]]]

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional[dict]

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional[Sampler]

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional[str]

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1
**kwargs

Additional keyword arguments for the parent class.

{}

modelgenerator.data.DiffusionDataModule

Bases: DataInterface, HFDatasetLoaderMixin

Data module for datasets with discrete diffusion-based noising and loss weights from MDLM.

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

required
config_name Optional[str]

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
x_col str

The column with the data to train on.

'sequence'
extra_cols List[str] | None

Additional columns to include in the dataset.

None
extra_col_aliases List[str] | None

The name of the columns to use as the alias for the extra columns.

None
timesteps_per_sample int

The number of timesteps per sample.

10
randomize_targets bool

Whether to randomize the target sequences for each timestep (experimental efficiency boost).

False
batch_size int

The batch size.

10
train_split_name Optional[str]

The name of the training split.

'train'
test_split_name Optional[str]

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional[str]

The name of the validation split.

None
train_split_files Optional[Union[str, List[str]]]

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Optional[Union[str, List[str]]]

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Optional[Union[str, List[str]]]

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional[dict]

Extra kwargs for dataset readers.

None
shuffle bool

Whether to shuffle the data.

True
sampler Optional[Sampler]

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional[callable]

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional[str]

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1
**kwargs

Additional keyword arguments for the parent class.

{}
Notes

Each sample includes timesteps_per_sample sequences at different noise levels Each sample's target sequences are under 'target_sequences', the input sequences are under 'sequences', and posterior weights are under 'posterior_weights'

modelgenerator.data.ClassDiffusionDataModule

Bases: SequenceClassificationDataModule

Data module for conditional (or class-filtered) diffusion, and applying discrete diffusion noising.

Note

Each sample includes timesteps_per_sample sequences at different noise levels Each sample's target sequences are under 'target_seqs', the input sequences are under 'input_seqs', and posterior weights are under 'posterior_weights'

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

required
config_name Optional[str]

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
x_col str

The name of the column containing the sequences.

'sequence'
y_col str | List[str]

The name of the column(s) containing the labels.

'label'
timesteps_per_sample int

The number of timesteps per sample.

10
randomize_targets bool

Whether to randomize the target sequences for each timestep (experimental efficiency boost).

False
batch_size int

The batch size.

10
extra_cols List[str] | None

Additional columns to include in the dataset.

None
extra_col_aliases List[str] | None

The name of the columns to use as the alias for the extra columns.

None
class_filter int | List[int] | None

Filter the dataset to only include samples with the specified class(es).

None
generate_uid bool

Whether to generate a unique ID for each sample.

False
train_split_name Optional[str]

The name of the training split.

'train'
test_split_name Optional[str]

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional[str]

The name of the validation split.

None
train_split_files Optional[Union[str, List[str]]]

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Optional[Union[str, List[str]]]

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Optional[Union[str, List[str]]]

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional[dict]

Extra kwargs for dataset readers.

None
shuffle bool

Whether to shuffle the data.

True
sampler Optional[Sampler]

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional[callable]

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional[str]

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1
**kwargs

Additional keyword arguments for the parent class.

{}

modelgenerator.data.ConditionalDiffusionDataModule

Bases: SequenceRegressionDataModule

Data module for conditional diffusion with a continuous condition, and applying discrete diffusion noising.

Note

Each sample includes timesteps_per_sample sequences at different noise levels Each sample's target sequences are under 'target_seqs', the input sequences are under 'input_seqs', and posterior weights are under 'posterior_weights'

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

required
config_name Optional[str]

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
x_col str

The name of columns containing the sequences.

'sequence'
y_col str

The name of columns containing the labels.

'label'
extra_cols List[str]

Additional columns to include in the dataset.

None
extra_col_aliases List[str]

The name of the columns to use as the alias for the extra columns.

None
normalize bool

Whether to normalize the labels.

True
generate_uid bool

Whether to generate a unique ID for each sample.

False
timesteps_per_sample int

The number of timesteps per sample.

10
randomize_targets bool

Whether to randomize the target sequences for each timestep (experimental efficiency boost).

False
batch_size int

The batch size.

10
train_split_name Optional[str]

The name of the training split.

'train'
test_split_name Optional[str]

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional[str]

The name of the validation split.

None
train_split_files Optional[Union[str, List[str]]]

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Optional[Union[str, List[str]]]

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Optional[Union[str, List[str]]]

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional[dict]

Extra kwargs for dataset readers.

None
shuffle bool

Whether to shuffle the data.

True
sampler Optional[Sampler]

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional[callable]

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional[str]

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1
**kwargs

Additional keyword arguments for the parent class.

{}

modelgenerator.data.MLMDataModule

Bases: SequenceClassificationDataModule

Data module for continuing pretraining on a masked language modeling task.

Note

Each sample includes a single sequence under key 'sequences' and a single target sequence under key 'target_sequences'

Parameters:

Name Type Description Default
path str

Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier

required
config_name Optional[str]

The name of the HF dataset configuration. Affects how the dataset is loaded.

None
masking_rate float

The masking rate. Defaults to 0.15.

0.15
x_col str

The name of the column containing the sequences.

'sequence'
y_col str | List[str]

The name of the column(s) containing the labels.

'label'
extra_cols List[str] | None

Additional columns to include in the dataset.

None
extra_col_aliases List[str] | None

The name of the columns to use as the alias for the extra columns.

None
class_filter int | List[int] | None

Filter the dataset to only include samples with the specified class(es).

None
generate_uid bool

Whether to generate a unique ID for each sample.

False
train_split_name Optional[str]

The name of the training split.

'train'
test_split_name Optional[str]

The name of the test split. Also used for mgen predict.

'test'
valid_split_name Optional[str]

The name of the validation split.

None
train_split_files Optional[Union[str, List[str]]]

Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.

None
test_split_files Optional[Union[str, List[str]]]

Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for mgen predict.

None
valid_split_files Optional[Union[str, List[str]]]

Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.

None
test_split_size float

The size of the test split. If test_split_name is None, creates a test split of this size from the training split.

0.2
valid_split_size float

The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.

0.1
random_seed int

The random seed to use for splitting the data.

42
extra_reader_kwargs Optional[dict]

Extra kwargs for dataset readers.

None
batch_size int

The batch size.

128
shuffle bool

Whether to shuffle the data.

True
sampler Optional[Sampler]

The sampler to use.

None
num_workers int

The number of workers to use for data loading.

0
collate_fn Optional[callable]

The function to use for collating data.

None
pin_memory bool

Whether to pin memory.

True
persistent_workers bool

Whether to use persistent workers.

False
cv_num_folds int

The number of cross-validation folds, disables cv when <= 1.

1
cv_test_fold_id int

The fold id to use for cross-validation evaluation.

0
cv_enable_val_fold bool

Whether to enable a validation fold.

True
cv_replace_val_fold_as_test_fold bool

Replace validation fold with test fold. Only used when cv_enable_val_fold is False.

False
cv_fold_id_col Optional[str]

The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.

None
cv_val_offset int

The offset applied to cv_test_fold_id to determine val_fold_id.

1
**kwargs

Additional keyword arguments for the parent class.

{}