Data

Nucleotide Transformer benchmarks from InstaDeep.

Note

Manuscript: The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics
Data Card: InstaDeepAI/nucleotide_transformer_downstream_tasks
Configs:
- promoter_all
- promoter_tata
- promoter_no_tata
- enhancers
- enhancers_types
- splice_sites_all
- splice_sites_acceptor
- splice_sites_donor
- H3
- H4
- H3K9ac
- H3K14ac
- H4ac
- H3K4me1
- H3K4me2
- H3K4me3
- H3K36me3
- H3K79me3

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	`'InstaDeepAI/nucleotide_transformer_downstream_tasks'`
`config_name`	`str`	The name of the HF dataset configuration. Affects how the dataset is loaded.	`'enhancers'`
`**kwargs`		Additional keyword arguments for the parent class.	`{}`

`modelgenerator.data.GUEClassification`

Genome Understanding Evaluation benchmarks for DNABERT-2 from the Liu Lab at Northwestern.

Note

Manuscript: DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome
Data Card: leannmlindsey/GUE
Configs:
- emp_H3
- emp_H3K14ac
- emp_H3K36me3
- emp_H3K4me1
- emp_H3K4me2
- emp_H3K4me3
- emp_H3K79me3
- emp_H3K9ac
- emp_H4
- emp_H4ac
- human_tf_0
- human_tf_1
- human_tf_2
- human_tf_3
- human_tf_4
- mouse_0
- mouse_1
- mouse_2
- mouse_3
- mouse_4
- prom_300_all
- prom_300_notata
- prom_300_tata
- prom_core_all
- prom_core_notata
- prom_core_tata
- splice_reconstructed
- virus_covid
- virus_species_40
- fungi_species_20
- EPI_K562
- EPI_HeLa-S3
- EPI_NHEK
- EPI_IMR90
- EPI_HUVEC

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	`'leannmlindsey/GUE'`
`config_name`	`str`	The name of the HF dataset configuration. Affects how the dataset is loaded.	`'emp_H3'`
`**kwargs`		Additional keyword arguments for the parent class.	`{}`

`modelgenerator.data.ClinvarRetrieve`

Bases: ZeroshotClassificationRetrieveDataModule

ClinVar dataset for genomic variant effect prediction.

Note

Manuscript: The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics
Data Card: genbio-ai/Clinvar

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	`None`
`test_split_files`	`List[str]`	Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for `mgen predict`.	`['ClinVar_Processed.tsv']`
`reference_file`	`str`	The file path to the reference file for retrieving sequences	`'hg38.ml.fa'`
`method`	`str`	method mode to compute metrics	`'Distance'`
`window`	`int`	The number of token taken on either side of the mutation site. The processed sequence length is `2 * window + 1`	`512`
`**kwargs`		Additional keyword arguments passed to the parent class. `train_split_name=None`, `valid_split_name=None`, and `valid_split_size=0` are always overridden.	`{}`

`modelgenerator.data.PromoterExpressionRegression`

Gene expression prediction from promoter sequences from the Regev Lab at the Broad Institute.

Note

Manuscript: Deciphering eukaryotic gene-regulatory logic with 100 million random promoters
Data Card: genbio-ai/100M-random-promoters

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	`'genbio-ai/100M-random-promoters'`
`x_col`	`str`	The name of columns containing the sequences.	`'sequence'`
`y_col`	`str`	The name of columns containing the labels.	`'label'`
`normalize`	`bool`	Whether to normalize the labels.	`True`
`**kwargs`		Additional keyword arguments for the parent class.	`{}`

`modelgenerator.data.PromoterExpressionGeneration`

Bases: ConditionalDiffusionDataModule

Promoter generation from gene expression data from the Regev Lab at the Broad Institute.

Note

Manuscript: Deciphering eukaryotic gene-regulatory logic with 100 million random promoters
Data Card: genbio-ai/100M-random-promoters

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	`'genbio-ai/100M-random-promoters'`
`x_col`	`str`	The name of columns containing the sequences.	`'sequence'`
`y_col`	`str`	The name of columns containing the labels.	`'label'`
`normalize`	`bool`	Whether to normalize the labels.	`True`
`**kwargs`		Additional keyword arguments for the parent class.	`{}`

`modelgenerator.data.DependencyMappingDataModule`

Bases: SequencesDataModule

Data module for doing dependency mapping via in silico mutagenesis on a dataset of sequences.

Note

Each sample includes a single sequence under key 'sequences' and optionally an 'ids' to track outputs.

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	required
`vocab_file`	`str`	The path to the file with the vocabulary to mutate.	required
`config_name`	`Optional[str]`	The name of the HF dataset configuration. Affects how the dataset is loaded.	`None`
`test_split_name`	`Optional[str]`	The name of the test split. Also used for `mgen predict`.	`None`
`test_split_files`	`Optional[str]`	Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for `mgen predict`.	`None`
`x_col`	`str`	The name of the column containing the sequences. Defaults to "sequence".	`'sequence'`
`id_col`	`str`	The name of the column containing the ids. Defaults to "id".	`'id'`
`**kwargs`		Additional keyword arguments for the parent class.	`{}`

RNA

`modelgenerator.data.TranslationEfficiency`

Translation efficiency prediction benchmarks from the Wang Lab at Princeton.

Note

Manuscript: A 5′ UTR language model for decoding untranslated regions of mRNA and function predictions
Data Card: genbio-ai/rna-downstream-tasks
Configs:
- translation_efficiency_Muscle
- translation_efficiency_HEK
- translation_efficiency_pc3

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	`'genbio-ai/rna-downstream-tasks'`
`config_name`	`str`	The name of the HF dataset configuration. Affects how the dataset is loaded.	`'translation_efficiency_Muscle'`
`x_col`		The name of columns containing the sequences.	`'sequences'`
`y_col`		The name of columns containing the labels.	`'labels'`
`normalize`	`bool`	Whether to normalize the labels.	`True`
`cv_num_folds`	`int`	The number of cross-validation folds, disables cv when <= 1.	`10`
`cv_test_fold_id`	`int`	The fold id to use for cross-validation evaluation.	`0`
`cv_enable_val_fold`	`bool`	Whether to enable a validation fold.	`True`
`cv_fold_id_col`	`str`	The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.	`'fold_id'`
`valid_split_name`	`str`	The name of the validation split.	`None`
`valid_split_size`	`float`	The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.	`0`
`test_split_name`	`str`	The name of the test split. Also used for `mgen predict`.	`None`
`test_split_size`	`float`	The size of the test split. If test_split_name is None, creates a test split of this size from the training split.	`0`
`**kwargs`		Additional keyword arguments for the parent class.	`{}`

`modelgenerator.data.ExpressionLevel`

Expression level prediction benchmarks from the Wang Lab at Princeton.

Note

Manuscript: A 5′ UTR language model for decoding untranslated regions of mRNA and function predictions
Data Card: genbio-ai/rna-downstream-tasks
Configs:
- expression_Muscle
- expression_HEK
- expression_pc3

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	`'genbio-ai/rna-downstream-tasks'`
`config_name`	`str`	The name of the HF dataset configuration. Affects how the dataset is loaded.	`'expression_Muscle'`
`x_col`	`str`	The name of columns containing the sequences.	`'sequences'`
`y_col`	`str`	The name of columns containing the labels.	`'labels'`
`normalize`	`bool`	Whether to normalize the labels.	`True`
`cv_num_folds`	`int`	The number of cross-validation folds, disables cv when <= 1.	`10`
`cv_test_fold_id`	`int`	The fold id to use for cross-validation evaluation.	`0`
`cv_enable_val_fold`	`bool`	Whether to enable a validation fold.	`True`
`cv_fold_id_col`	`str`	The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.	`'fold_id'`
`valid_split_name`	`str`	The name of the validation split.	`None`
`valid_split_size`	`float`	The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.	`0`
`test_split_name`	`str`	The name of the test split. Also used for `mgen predict`.	`None`
`test_split_size`	`float`	The size of the test split. If test_split_name is None, creates a test split of this size from the training split.	`0`
`**kwargs`		Additional keyword arguments for the parent class.	`{}`

`modelgenerator.data.TranscriptAbundance`

Transcript abundance prediction benchmarks from the Wang Lab at Princeton.

Note

Manuscript: A 5′ UTR language model for decoding untranslated regions of mRNA and function predictions
Data Card: genbio-ai/rna-downstream-tasks
Configs:
- transcript_abundance_athaliana
- transcript_abundance_dmelanogaster
- transcript_abundance_ecoli
- transcript_abundance_hsapiens
- transcript_abundance_hvolcanii
- transcript_abundance_ppastoris
- transcript_abundance_scerevisiae

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	`'genbio-ai/rna-downstream-tasks'`
`config_name`	`str`	The name of the HF dataset configuration. Affects how the dataset is loaded.	`'transcript_abundance_athaliana'`
`x_col`	`str`	The name of columns containing the sequences.	`'sequences'`
`y_col`	`str`	The name of columns containing the labels.	`'labels'`
`normalize`	`bool`	Whether to normalize the labels.	`True`
`cv_num_folds`	`int`	The number of cross-validation folds, disables cv when <= 1.	`5`
`cv_test_fold_id`	`int`	The fold id to use for cross-validation evaluation.	`0`
`cv_enable_val_fold`	`bool`	Whether to enable a validation fold.	`True`
`cv_fold_id_col`	`str`	The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.	`'fold_id'`
`valid_split_name`	`str`	The name of the validation split.	`None`
`valid_split_size`	`float`	The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.	`0`
`test_split_name`	`str`	The name of the test split. Also used for `mgen predict`.	`None`
`test_split_size`	`float`	The size of the test split. If test_split_name is None, creates a test split of this size from the training split.	`0`
`**kwargs`		Additional keyword arguments for the parent class.	`{}`

`modelgenerator.data.ProteinAbundance`

Protein abundance prediction benchmarks from the Wang Lab at Princeton.

Note

Manuscript: A 5′ UTR language model for decoding untranslated regions of mRNA and function predictions
Data Card: genbio-ai/rna-downstream-tasks
Configs:
- protein_abundance_athaliana
- protein_abundance_dmelanogaster
- protein_abundance_ecoli
- protein_abundance_hsapiens
- protein_abundance_scerevisiae

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	`'genbio-ai/rna-downstream-tasks'`
`config_name`	`str`	The name of the HF dataset configuration. Affects how the dataset is loaded.	`'protein_abundance_athaliana'`
`x_col`	`str`	The name of columns containing the sequences.	`'sequences'`
`y_col`	`str`	The name of columns containing the labels.	`'labels'`
`normalize`	`bool`	Whether to normalize the labels.	`True`
`cv_num_folds`	`int`	The number of cross-validation folds, disables cv when <= 1.	`5`
`cv_test_fold_id`	`int`	The fold id to use for cross-validation evaluation.	`0`
`cv_enable_val_fold`	`bool`	Whether to enable a validation fold.	`True`
`cv_fold_id_col`	`str`	The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.	`'fold_id'`
`valid_split_name`	`str`	The name of the validation split.	`None`
`valid_split_size`	`float`	The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.	`0`
`test_split_name`	`str`	The name of the test split. Also used for `mgen predict`.	`None`
`test_split_size`	`float`	The size of the test split. If test_split_name is None, creates a test split of this size from the training split.	`0`
`**kwargs`		Additional keyword arguments for the parent class.	`{}`

`modelgenerator.data.NcrnaFamilyClassification`

Non-coding RNA family classification benchmarks from DPTechnology.

Note

Manuscript: UNI-RNA: UNIVERSAL PRE-TRAINED MODELS REVOLUTIONIZE RNA RESEARCH
Data Card: genbio-ai/rna-downstream-tasks
Configs:
- ncrna_family_bnoise0
- ncrna_family_bnoise200

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	`'genbio-ai/rna-downstream-tasks'`
`config_name`	`str`	The name of the HF dataset configuration. Affects how the dataset is loaded.	`'ncrna_family_bnoise0'`
`x_col`	`str`	The name of the column containing the sequences.	`'sequences'`
`y_col`	`str`	The name of the column(s) containing the labels.	`'labels'`
`train_split_name`	`str`	The name of the training split.	`'train'`
`valid_split_name`	`str`	The name of the validation split.	`'validation'`
`test_split_name`	`str`	The name of the test split. Also used for `mgen predict`.	`'test'`
`**kwargs`		Additional keyword arguments for the parent class.	`{}`

`modelgenerator.data.SpliceSitePrediction`

Splice site prediction benchmarks from the Thompson Lab at University of Strasbourg.

Note

Manuscript: Spliceator: multi-species splice site prediction using convolutional neural networks
Data Card: genbio-ai/rna-downstream-tasks
Configs:
- splice_site_acceptor
- splice_site_donor

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	`'genbio-ai/rna-downstream-tasks'`
`config_name`	`str`	The name of the HF dataset configuration. Affects how the dataset is loaded.	`'splice_site_acceptor'`
`x_col`	`str`	The name of the column containing the sequences.	`'sequences'`
`y_col`	`str`	The name of the column(s) containing the labels.	`'labels'`
`train_split_name`	`str`	The name of the training split.	`'train'`
`valid_split_name`	`str`	The name of the validation split.	`'validation'`
`test_split_name`	`str`	The name of the test split. Also used for `mgen predict`.	`'test_danio'`
`batch_size`	`int`	The batch size.	`16`
`**kwargs`		Additional keyword arguments for the parent class.	`{}`

`modelgenerator.data.ModificationSitePrediction`

Modification site prediction benchmarks from the Meng Lab at the University of Liverpool.

Note

Manuscript: Attention-based multi-label neural networks for integrated prediction and interpretation of twelve widely occurring RNA modifications
Data Card: genbio-ai/rna-downstream-tasks
Configs:
- modification_site

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	`'genbio-ai/rna-downstream-tasks'`
`config_name`	`str`	The name of the HF dataset configuration. Affects how the dataset is loaded.	`'modification_site'`
`x_col`	`str`	The name of the column containing the sequences.	`'sequences'`
`y_col`	`List[str]`	The name of the column(s) containing the labels.	`[f'labels_{i}' for i in range(12)]`
`train_split_name`	`str`	The name of the training split.	`'train'`
`valid_split_name`	`str`	The name of the validation split.	`'validation'`
`test_split_name`	`str`	The name of the test split. Also used for `mgen predict`.	`'test'`
`**kwargs`		Additional keyword arguments for the parent class.	`{}`

`modelgenerator.data.RNAMeanRibosomeLoadDataModule`

Data module for the mean ribosome load dataset.

Note

Manuscript: Human 5′ UTR design and variant effect prediction from a massively parallel translation assay
Data Card: genbio-ai/rna-downstream-tasks

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	`'genbio-ai/rna-downstream-tasks'`
`config_name`	`str`	The name of the HF dataset configuration. Affects how the dataset is loaded.	`'mean_ribosome_load'`
`train_split_name`	`str`	The name of the training split.	`'train'`
`valid_split_name`	`str`	The name of the validation split.	`'validation'`
`test_split_name`	`str`	The name of the test split. Also used for `mgen predict`.	`'test'`
`x_col`	`str`	The name of columns containing the sequences.	`'utr'`
`y_col`	`str`	The name of columns containing the labels.	`'rl'`
`extra_cols`	`List[str]`	Additional columns to include in the dataset.	`None`
`extra_col_aliases`	`List[str]`	The name of the columns to use as the alias for the extra columns.	`None`
`normalize`	`bool`	Whether to normalize the labels.	`False`
`generate_uid`	`bool`	Whether to generate a unique ID for each sample.	`False`
`**kwargs`		Additional keyword arguments passed to the parent class.	`{}`

Protein

`modelgenerator.data.ContactPredictionBinary`

Protein contact prediction benchmarks from BioMap.

Note

Manuscript: xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein
Data Card: proteinglm/contact_prediction_binary

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	`'proteinglm/contact_prediction_binary'`
`pairwise`	`bool`	Whether the labels are pairwise.	`True`
`x_col`	`str`	The name of the column containing the sequences.	`'seq'`
`y_col`	`str`	The name of the column containing the labels.	`'label'`
`batch_size`	`int`	The batch size.	`1`
`max_context_length`	`int`	Maximum context length for the input sequences.	`12800`
`msa_random_seed`	`Optional[int]`	Random seed for MSA generation.	`None`
`is_rag_dataset`	`bool`	Whether the dataset is a RAG dataset for AIDO.Protein-RAG.	`False`
`**kwargs`		Additional keyword arguments for the parent class.	`{}`

`modelgenerator.data.SspQ3`

Protein secondary structure prediction benchmarks from BioMap.

Note

Manuscript: xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein
Data Card: proteinglm/ssp_q3

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	`'proteinglm/ssp_q3'`
`pairwise`	`bool`	Whether the labels are pairwise.	`False`
`x_col`	`str`	The name of the column containing the sequences.	`'seq'`
`y_col`	`str`	The name of the column containing the labels.	`'label'`
`batch_size`	`int`	The batch size.	`1`
`max_context_length`	`int`	Maximum context length for the input sequences.	`12800`
`msa_random_seed`	`Optional[int]`	Random seed for MSA generation.	`None`
`is_rag_dataset`	`bool`	Whether the dataset is a RAG dataset for AIDO.Protein-RAG.	`False`
`**kwargs`		Additional keyword arguments for the parent class.	`{}`

`modelgenerator.data.FoldPrediction`

Protein fold prediction benchmarks from BioMap.

Note

Manuscript: xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein
Data Card: proteinglm/fold_prediction

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	`'proteinglm/fold_prediction'`
`x_col`	`str`	The name of the column containing the sequences.	`'seq'`
`y_col`	`str`	The name of the column(s) containing the labels.	`'label'`
`max_context_length`	`int`	Maximum context length for the input sequences.	`12800`
`msa_random_seed`	`Optional[int]`	Random seed for MSA generation.	`None`
`is_rag_dataset`	`bool`	Whether the dataset is a RAG dataset for AIDO.Protein-RAG.	`False`
`**kwargs`		Additional keyword arguments for the parent class.	`{}`

`modelgenerator.data.LocalizationPrediction`

Protein localization prediction benchmarks from BioMap.

Note

Manuscript: xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein
Data Card: proteinglm/localization_prediction

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	`'proteinglm/localization_prediction'`
`x_col`	`str`	The name of the column containing the sequences.	`'seq'`
`y_col`	`str`	The name of the column(s) containing the labels.	`'label'`
`**kwargs`		Additional keyword arguments for the parent class.	`{}`

`modelgenerator.data.MetalIonBinding`

Metal ion binding prediction benchmarks from BioMap.

Note

Manuscript: xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein
Data Card: proteinglm/metal_ion_binding

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	`'proteinglm/metal_ion_binding'`
`x_col`	`str`	The name of the column containing the sequences.	`'seq'`
`y_col`	`str`	The name of the column(s) containing the labels.	`'label'`
`**kwargs`		Additional keyword arguments for the parent class.	`{}`

`modelgenerator.data.SolubilityPrediction`

Protein solubility prediction benchmarks from BioMap.

Note

Manuscript: xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein
Data Card: proteinglm/solubility_prediction

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	`'proteinglm/solubility_prediction'`
`x_col`	`str`	The name of the column containing the sequences.	`'seq'`
`y_col`	`str`	The name of the column(s) containing the labels.	`'label'`
`**kwargs`		Additional keyword arguments for the parent class.	`{}`

`modelgenerator.data.AntibioticResistance`

Antibiotic resistance prediction benchmarks from BioMap.

Note

Manuscript: xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein
Data Card: proteinglm/antibiotic_resistance

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	`'proteinglm/antibiotic_resistance'`
`x_col`	`str`	The name of the column containing the sequences.	`'seq'`
`y_col`	`str`	The name of the column(s) containing the labels.	`'label'`
`**kwargs`		Additional keyword arguments for the parent class.	`{}`

`modelgenerator.data.CloningClf`

Cloning classification prediction benchmarks from BioMap.

Note

Manuscript: xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein
Data Card: proteinglm/cloning_clf

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	`'proteinglm/cloning_clf'`
`x_col`	`str`	The name of the column containing the sequences.	`'seq'`
`y_col`	`str`	The name of the column(s) containing the labels.	`'label'`
`**kwargs`		Additional keyword arguments for the parent class.	`{}`

`modelgenerator.data.MaterialProduction`

Material production prediction benchmarks from BioMap.

Note

Manuscript: xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein
Data Card: proteinglm/material_production

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	`'proteinglm/material_production'`
`x_col`	`str`	The name of the column containing the sequences.	`'seq'`
`y_col`	`str`	The name of the column(s) containing the labels.	`'label'`
`**kwargs`		Additional keyword arguments for the parent class.	`{}`

`modelgenerator.data.TcrPmhcAffinity`

TCR-pMHC affinity prediction benchmarks from BioMap.

Note

Manuscript: xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein
Data Card: proteinglm/tcr_pmhc_affinity

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	`'proteinglm/tcr_pmhc_affinity'`
`x_col`	`str`	The name of the column containing the sequences.	`'seq'`
`y_col`	`str`	The name of the column(s) containing the labels.	`'label'`
`**kwargs`		Additional keyword arguments for the parent class.	`{}`

`modelgenerator.data.PeptideHlaMhcAffinity`

Peptide-HLA-MHC affinity prediction benchmarks from BioMap. Note: - Manuscript: xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein - Data Card: proteinglm/peptide_HLA_MHC_affinity

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	`'proteinglm/peptide_HLA_MHC_affinity'`
`x_col`	`str`	The name of the column containing the sequences.	`'seq'`
`y_col`	`str`	The name of the column(s) containing the labels.	`'label'`
`**kwargs`		Additional keyword arguments for the parent class.	`{}`

`modelgenerator.data.TemperatureStability`

Temperature stability prediction benchmarks from BioMap.

Note

Manuscript: xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein
Data Card: proteinglm/temperature_stability

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	`'proteinglm/temperature_stability'`
`x_col`	`str`	The name of the column containing the sequences.	`'seq'`
`y_col`	`str`	The name of the column(s) containing the labels.	`'label'`
`**kwargs`		Additional keyword arguments for the parent class.	`{}`

`modelgenerator.data.FluorescencePrediction`

Fluorescence prediction benchmarks from BioMap.

Note

Manuscript: xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein
Data Card: proteinglm/fluorescence_prediction

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	`'proteinglm/fluorescence_prediction'`
`x_col`	`str`	The name of columns containing the sequences.	`'seq'`
`y_col`	`str`	The name of columns containing the labels.	`'label'`
`normalize`	`bool`	Whether to normalize the labels.	`True`
`max_context_length`	`int`	Maximum context length for the input sequences.	`12800`
`msa_random_seed`	`Optional[int]`	Random seed for MSA generation.	`None`
`is_rag_dataset`	`bool`	Whether the dataset is a RAG dataset for AIDO.Protein-RAG.	`False`
`**kwargs`		Additional keyword arguments for the parent class.	`{}`

`modelgenerator.data.FitnessPrediction`

Fitness prediction benchmarks from BioMap.

Note

Manuscript: xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein
Data Card: proteinglm/fitness_prediction

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	`'proteinglm/fitness_prediction'`
`x_col`	`str`	The name of columns containing the sequences.	`'seq'`
`y_col`	`str`	The name of columns containing the labels.	`'label'`
`normalize`	`bool`	Whether to normalize the labels.	`True`
`**kwargs`		Additional keyword arguments for the parent class.	`{}`

`modelgenerator.data.StabilityPrediction`

Stability prediction benchmarks from BioMap.

Note

Manuscript: xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein
Data Card: proteinglm/stability_prediction

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	`'proteinglm/stability_prediction'`
`x_col`	`str`	The name of columns containing the sequences.	`'seq'`
`y_col`	`str`	The name of columns containing the labels.	`'label'`
`normalize`	`bool`	Whether to normalize the labels.	`True`
`**kwargs`		Additional keyword arguments for the parent class.	`{}`

`modelgenerator.data.EnzymeCatalyticEfficiencyPrediction`

Enzyme catalytic efficiency prediction benchmarks from BioMap.

Note

Manuscript: xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein
Data Card: proteinglm/enzyme_catalytic_efficiency

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	`'proteinglm/enzyme_catalytic_efficiency'`
`x_col`	`str`	The name of columns containing the sequences.	`'seq'`
`y_col`	`str`	The name of columns containing the labels.	`'label'`
`normalize`	`bool`	Whether to normalize the labels.	`True`
`**kwargs`		Additional keyword arguments for the parent class.	`{}`

`modelgenerator.data.OptimalTemperaturePrediction`

Optimal temperature prediction benchmarks from BioMap.

Note

Manuscript: xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein
Data Card: proteinglm/optimal_temperature

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	`'proteinglm/optimal_temperature'`
`x_col`	`str`	The name of columns containing the sequences.	`'seq'`
`y_col`	`str`	The name of columns containing the labels.	`'label'`
`normalize`	`bool`	Whether to normalize the labels.	`True`
`**kwargs`		Additional keyword arguments for the parent class.	`{}`

`modelgenerator.data.OptimalPhPrediction`

Optimal pH prediction benchmarks from BioMap.

Note

Manuscript: xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein
Data Card: proteinglm/optimal_ph

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	`'proteinglm/optimal_ph'`
`x_col`	`str`	The name of columns containing the sequences.	`'seq'`
`y_col`	`str`	The name of columns containing the labels.	`'label'`
`normalize`	`bool`	Whether to normalize the labels.	`True`
`**kwargs`		Additional keyword arguments for the parent class.	`{}`

`modelgenerator.data.DMSFitnessPrediction`

Deep mutational scanning (DMS) fitness prediction benchmarks from the Gal Lab at Oxford and the Marks Lab at Harvard.

Note

Manuscript: ProteinGym: Large-Scale Benchmarks for Protein Fitness Prediction and Design
Data Card: genbio-ai/ProteinGYM-DMS

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	`'genbio-ai/ProteinGYM-DMS'`
`train_split_files`	`list[str]`	Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.	`['indels/B1LPA6_ECOSM_Russ_2020_indels.tsv']`
`x_col`	`str`	The name of columns containing the sequences.	`'sequences'`
`y_col`	`str`	The name of columns containing the labels.	`'labels'`
`cv_num_folds`	`int`	The number of cross-validation folds, disables cv when <= 1.	`5`
`cv_test_fold_id`	`int`	The fold id to use for cross-validation evaluation.	`0`
`cv_enable_val_fold`	`bool`	Whether to enable a validation fold.	`True`
`cv_replace_val_fold_as_test_fold`	`bool`	Replace validation fold with test fold. Only used when cv_enable_val_fold is False.	`False`
`cv_fold_id_col`	`str`	The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.	`'fold_id'`
`cv_val_offset`	`int`	The offset applied to cv_test_fold_id to determine val_fold_id.	`-1`
`valid_split_name`	`str`	The name of the validation split.	`None`
`valid_split_size`	`float`	The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.	`0`
`test_split_name`	`str`	The name of the test split. Also used for `mgen predict`.	`None`
`test_split_size`	`float`	The size of the test split. If test_split_name is None, creates a test split of this size from the training split.	`0`
`max_context_length`	`int`	Maximum context length for the input sequences.	`12800`
`msa_random_seed`	`Optional[int]`	Random seed for MSA generation.	`None`
`is_rag_dataset`	`bool`	Whether the dataset is a RAG dataset for AIDO.Protein-RAG.	`False`
`**kwargs`		Additional keyword arguments for the parent class.	`{}`

Structure

`modelgenerator.data.ContactPredictionBinary`

Protein contact prediction benchmarks from BioMap.

Note

Manuscript: xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein
Data Card: proteinglm/contact_prediction_binary

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	`'proteinglm/contact_prediction_binary'`
`pairwise`	`bool`	Whether the labels are pairwise.	`True`
`x_col`	`str`	The name of the column containing the sequences.	`'seq'`
`y_col`	`str`	The name of the column containing the labels.	`'label'`
`batch_size`	`int`	The batch size.	`1`
`max_context_length`	`int`	Maximum context length for the input sequences.	`12800`
`msa_random_seed`	`Optional[int]`	Random seed for MSA generation.	`None`
`is_rag_dataset`	`bool`	Whether the dataset is a RAG dataset for AIDO.Protein-RAG.	`False`
`**kwargs`		Additional keyword arguments for the parent class.	`{}`

`modelgenerator.data.SspQ3`

Protein secondary structure prediction benchmarks from BioMap.

Note

Manuscript: xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein
Data Card: proteinglm/ssp_q3

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	`'proteinglm/ssp_q3'`
`pairwise`	`bool`	Whether the labels are pairwise.	`False`
`x_col`	`str`	The name of the column containing the sequences.	`'seq'`
`y_col`	`str`	The name of the column containing the labels.	`'label'`
`batch_size`	`int`	The batch size.	`1`
`max_context_length`	`int`	Maximum context length for the input sequences.	`12800`
`msa_random_seed`	`Optional[int]`	Random seed for MSA generation.	`None`
`is_rag_dataset`	`bool`	Whether the dataset is a RAG dataset for AIDO.Protein-RAG.	`False`
`**kwargs`		Additional keyword arguments for the parent class.	`{}`

`modelgenerator.data.FoldPrediction`

Protein fold prediction benchmarks from BioMap.

Note

Manuscript: xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein
Data Card: proteinglm/fold_prediction

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	`'proteinglm/fold_prediction'`
`x_col`	`str`	The name of the column containing the sequences.	`'seq'`
`y_col`	`str`	The name of the column(s) containing the labels.	`'label'`
`max_context_length`	`int`	Maximum context length for the input sequences.	`12800`
`msa_random_seed`	`Optional[int]`	Random seed for MSA generation.	`None`
`is_rag_dataset`	`bool`	Whether the dataset is a RAG dataset for AIDO.Protein-RAG.	`False`
`**kwargs`		Additional keyword arguments for the parent class.	`{}`

`modelgenerator.data.FluorescencePrediction`

Fluorescence prediction benchmarks from BioMap.

Note

Manuscript: xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein
Data Card: proteinglm/fluorescence_prediction

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	`'proteinglm/fluorescence_prediction'`
`x_col`	`str`	The name of columns containing the sequences.	`'seq'`
`y_col`	`str`	The name of columns containing the labels.	`'label'`
`normalize`	`bool`	Whether to normalize the labels.	`True`
`max_context_length`	`int`	Maximum context length for the input sequences.	`12800`
`msa_random_seed`	`Optional[int]`	Random seed for MSA generation.	`None`
`is_rag_dataset`	`bool`	Whether the dataset is a RAG dataset for AIDO.Protein-RAG.	`False`
`**kwargs`		Additional keyword arguments for the parent class.	`{}`

`modelgenerator.data.DMSFitnessPrediction`

Deep mutational scanning (DMS) fitness prediction benchmarks from the Gal Lab at Oxford and the Marks Lab at Harvard.

Note

Manuscript: ProteinGym: Large-Scale Benchmarks for Protein Fitness Prediction and Design
Data Card: genbio-ai/ProteinGYM-DMS

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	`'genbio-ai/ProteinGYM-DMS'`
`train_split_files`	`list[str]`	Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.	`['indels/B1LPA6_ECOSM_Russ_2020_indels.tsv']`
`x_col`	`str`	The name of columns containing the sequences.	`'sequences'`
`y_col`	`str`	The name of columns containing the labels.	`'labels'`
`cv_num_folds`	`int`	The number of cross-validation folds, disables cv when <= 1.	`5`
`cv_test_fold_id`	`int`	The fold id to use for cross-validation evaluation.	`0`
`cv_enable_val_fold`	`bool`	Whether to enable a validation fold.	`True`
`cv_replace_val_fold_as_test_fold`	`bool`	Replace validation fold with test fold. Only used when cv_enable_val_fold is False.	`False`
`cv_fold_id_col`	`str`	The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.	`'fold_id'`
`cv_val_offset`	`int`	The offset applied to cv_test_fold_id to determine val_fold_id.	`-1`
`valid_split_name`	`str`	The name of the validation split.	`None`
`valid_split_size`	`float`	The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.	`0`
`test_split_name`	`str`	The name of the test split. Also used for `mgen predict`.	`None`
`test_split_size`	`float`	The size of the test split. If test_split_name is None, creates a test split of this size from the training split.	`0`
`max_context_length`	`int`	Maximum context length for the input sequences.	`12800`
`msa_random_seed`	`Optional[int]`	Random seed for MSA generation.	`None`
`is_rag_dataset`	`bool`	Whether the dataset is a RAG dataset for AIDO.Protein-RAG.	`False`
`**kwargs`		Additional keyword arguments for the parent class.	`{}`

`modelgenerator.data.StructureTokenDataModule`

Test only data module for structure token predictors.

This data module is specifically designed for handling datasets uses amino acid sequences as input and structure tokens as labels.

Note

This module only supports testing and ignores training and validation splits. It assumes test split files contain sequences and optionally their structural token labels. If structural token labels are not provided, dummy labels are created.

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	required
`config_name`	`Optional[str]`	The name of the HF dataset configuration. Affects how the dataset is loaded.	`None`
`test_split_files`	`Optional[List[str]]`	Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for `mgen predict`.	`None`
`batch_size`	`int`	The batch size.	`1`
`**kwargs`		Additional keyword arguments passed to the parent class, in which training and validation split settings are overridden so that only the test split is loaded.	`{}`

Cell

`modelgenerator.data.CellClassificationDataModule`

Bases: DataInterface

Data module for cell classification.

Note

Each sample includes a feature vector (one of the rows in ) and a single class label (one of the columns in )

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	required
`backbone_class_path`	`Optional[str]`	Class path of the backbone model.	`None`
`filter_columns`	`Optional[list[str]]`	The columns of we want to use. Defaults to None, in which case all columns are used.	`None`
`rename_columns`	`Optional[list[str]]`	New name of columns. Defaults to None, in which case columns are not renamed. Does nothing if filter_colums is None.	`None`
`config_name`	`Optional[str]`	The name of the HF dataset configuration. Affects how the dataset is loaded.	`None`
`train_split_name`	`Optional[str]`	The name of the training split.	`'train'`
`test_split_name`	`Optional[str]`	The name of the test split. Also used for `mgen predict`.	`'test'`
`valid_split_name`	`Optional[str]`	The name of the validation split.	`None`
`train_split_files`	`Optional[Union[str, List[str]]]`	Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.	`None`
`test_split_files`	`Optional[Union[str, List[str]]]`	Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for `mgen predict`.	`None`
`valid_split_files`	`Optional[Union[str, List[str]]]`	Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.	`None`
`test_split_size`	`float`	The size of the test split. If test_split_name is None, creates a test split of this size from the training split.	`0.2`
`valid_split_size`	`float`	The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.	`0.1`
`random_seed`	`int`	The random seed to use for splitting the data.	`42`
`extra_reader_kwargs`	`Optional[dict]`	Extra kwargs for dataset readers.	`None`
`batch_size`	`int`	The batch size.	`128`
`shuffle`	`bool`	Whether to shuffle the data.	`True`
`sampler`	`Optional[Sampler]`	The sampler to use.	`None`
`num_workers`	`int`	The number of workers to use for data loading.	`0`
`collate_fn`	`Optional[callable]`	The function to use for collating data.	`None`
`pin_memory`	`bool`	Whether to pin memory.	`True`
`persistent_workers`	`bool`	Whether to use persistent workers.	`False`
`cv_num_folds`	`int`	The number of cross-validation folds, disables cv when <= 1.	`1`
`cv_test_fold_id`	`int`	The fold id to use for cross-validation evaluation.	`0`
`cv_enable_val_fold`	`bool`	Whether to enable a validation fold.	`True`
`cv_replace_val_fold_as_test_fold`	`bool`	Replace validation fold with test fold. Only used when cv_enable_val_fold is False.	`False`
`cv_fold_id_col`	`Optional[str]`	The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.	`None`
`cv_val_offset`	`int`	The offset applied to cv_test_fold_id to determine val_fold_id.	`1`
`**kwargs`		Additional keyword arguments passed to the parent class.	`{}`

`modelgenerator.data.CellClassificationLargeDataModule`

Bases: DataInterface

Data module for cell classification. This class handles large dataset and is implemented based on TileDB.

Note

Each sample includes a feature vector (one of the rows in ) and a single class label (one of the columns in )

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the TileDB dataset folder	required
`train_split_subfolder`	`str`	Subfolder name for the training split.	required
`valid_split_subfolder`	`str`	Subfolder name for the validation split.	required
`test_split_subfolder`	`str`	Subfolder name for the test split.	required
`backbone_class_path`	`Optional[str]`	Class path of the backbone model.	`None`
`layer_name`	`str`	Name of the layer in the TileDB dataset.	`'data'`
`obs_column_name`	`str`	Name of the column in to use as the label.	`'cell_type'`
`measurement_name`	`str`	Name of the measurement in the TileDB dataset.	`'RNA'`
`axis_query_value_filter`	`Optional[str]`	Optional filter for the axis query.	`None`
`prefetch_factor`	`int`	Number of batches to prefetch.	`16`
`config_name`	`Optional[str]`	The name of the HF dataset configuration. Affects how the dataset is loaded.	`None`
`train_split_name`	`Optional[str]`	The name of the training split.	`'train'`
`test_split_name`	`Optional[str]`	The name of the test split. Also used for `mgen predict`.	`'test'`
`valid_split_name`	`Optional[str]`	The name of the validation split.	`None`
`train_split_files`	`Optional[Union[str, List[str]]]`	Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.	`None`
`test_split_files`	`Optional[Union[str, List[str]]]`	Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for `mgen predict`.	`None`
`valid_split_files`	`Optional[Union[str, List[str]]]`	Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.	`None`
`test_split_size`	`float`	The size of the test split. If test_split_name is None, creates a test split of this size from the training split.	`0.2`
`valid_split_size`	`float`	The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.	`0.1`
`random_seed`	`int`	The random seed to use for splitting the data.	`42`
`extra_reader_kwargs`	`Optional[dict]`	Extra kwargs for dataset readers.	`None`
`batch_size`	`int`	The batch size.	`128`
`shuffle`	`bool`	Whether to shuffle the data.	`True`
`sampler`	`Optional[Sampler]`	The sampler to use.	`None`
`num_workers`	`int`	The number of workers to use for data loading.	`0`
`collate_fn`	`Optional[callable]`	The function to use for collating data.	`None`
`pin_memory`	`bool`	Whether to pin memory.	`True`
`persistent_workers`	`bool`	Whether to use persistent workers.	`False`
`cv_num_folds`	`int`	The number of cross-validation folds, disables cv when <= 1.	`1`
`cv_test_fold_id`	`int`	The fold id to use for cross-validation evaluation.	`0`
`cv_enable_val_fold`	`bool`	Whether to enable a validation fold.	`True`
`cv_replace_val_fold_as_test_fold`	`bool`	Replace validation fold with test fold. Only used when cv_enable_val_fold is False.	`False`
`cv_fold_id_col`	`Optional[str]`	The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.	`None`
`cv_val_offset`	`int`	The offset applied to cv_test_fold_id to determine val_fold_id.	`1`
`**kwargs`		Additional keyword arguments passed to the parent class.	`{}`

`modelgenerator.data.ClockDataModule`

Bases: DataInterface

Data module for transcriptomic clock tasks.

Note

Each sample includes a feature vector (one of the rows in ) and a single scalar corresponding to donor age (one of the columns in )

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	required
`split_column`	`str`	The column of that defines the split assignments.	required
`label_scaling`	`Optional[str]`	The type of label scaling to apply.	`'z_scaling'`
`backbone_class_path`	`Optional[str]`	Class path of the backbone model.	`None`
`filter_columns`	`Optional[list[str]]`	The columns of we want to use. Defaults to None, in which case all columns are used.	`None`
`rename_columns`	`Optional[list[str]]`	New name of columns. Defaults to None, in which case columns are not renamed. Does nothing if filter_colums is None.	`None`
`config_name`	`Optional[str]`	The name of the HF dataset configuration. Affects how the dataset is loaded.	`None`
`train_split_name`	`Optional[str]`	The name of the training split.	`'train'`
`test_split_name`	`Optional[str]`	The name of the test split. Also used for `mgen predict`.	`'test'`
`valid_split_name`	`Optional[str]`	The name of the validation split.	`None`
`train_split_files`	`Optional[Union[str, List[str]]]`	Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.	`None`
`test_split_files`	`Optional[Union[str, List[str]]]`	Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for `mgen predict`.	`None`
`valid_split_files`	`Optional[Union[str, List[str]]]`	Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.	`None`
`test_split_size`	`float`	The size of the test split. If test_split_name is None, creates a test split of this size from the training split.	`0.2`
`valid_split_size`	`float`	The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.	`0.1`
`random_seed`	`int`	The random seed to use for splitting the data.	`42`
`extra_reader_kwargs`	`Optional[dict]`	Extra kwargs for dataset readers.	`None`
`batch_size`	`int`	The batch size.	`128`
`shuffle`	`bool`	Whether to shuffle the data.	`True`
`sampler`	`Optional[Sampler]`	The sampler to use.	`None`
`num_workers`	`int`	The number of workers to use for data loading.	`0`
`collate_fn`	`Optional[callable]`	The function to use for collating data.	`None`
`pin_memory`	`bool`	Whether to pin memory.	`True`
`persistent_workers`	`bool`	Whether to use persistent workers.	`False`
`cv_num_folds`	`int`	The number of cross-validation folds, disables cv when <= 1.	`1`
`cv_test_fold_id`	`int`	The fold id to use for cross-validation evaluation.	`0`
`cv_enable_val_fold`	`bool`	Whether to enable a validation fold.	`True`
`cv_replace_val_fold_as_test_fold`	`bool`	Replace validation fold with test fold. Only used when cv_enable_val_fold is False.	`False`
`cv_fold_id_col`	`Optional[str]`	The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.	`None`
`cv_val_offset`	`int`	The offset applied to cv_test_fold_id to determine val_fold_id.	`1`
`**kwargs`		Additional keyword arguments passed to the parent class.	`{}`

`modelgenerator.data.PertClassificationDataModule`

Bases: DataInterface

Data module for perturbation classification.

Note

Each sample includes a feature vector (one of the rows in ) and a single class label (one of the columns in )

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	required
`pert_column`	`str`	Column of containing perturbation labels.	required
`cell_line_column`	`str`	Column of containing cell line labels.	required
`cell_line`	`str`	Name of cell line to consider.	required
`split_seed`	`int`	Seed for train/val/test splits.	`1234`
`train_frac`	`float`	Fraction of examples to assign to train set.	`0.7`
`val_frac`	`float`	Fraction of examples to assign to val set.	`0.15`
`test_frac`	`float`	Fraction of examples to assign to test set.	`0.15`
`backbone_class_path`	`Optional[str]`	Class path of the backbone model.	`None`
`filter_columns`	`Optional[list[str]]`	The columns of we want to use. Defaults to None, in which case all columns are used.	`None`
`rename_columns`	`Optional[list[str]]`	New name of columns. Defaults to None, in which case columns are not renamed. Does nothing if filter_colums is None.	`None`
`config_name`	`Optional[str]`	The name of the HF dataset configuration. Affects how the dataset is loaded.	`None`
`train_split_name`	`Optional[str]`	The name of the training split.	`'train'`
`test_split_name`	`Optional[str]`	The name of the test split. Also used for `mgen predict`.	`'test'`
`valid_split_name`	`Optional[str]`	The name of the validation split.	`None`
`train_split_files`	`Optional[Union[str, List[str]]]`	Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.	`None`
`test_split_files`	`Optional[Union[str, List[str]]]`	Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for `mgen predict`.	`None`
`valid_split_files`	`Optional[Union[str, List[str]]]`	Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.	`None`
`test_split_size`	`float`	The size of the test split. If test_split_name is None, creates a test split of this size from the training split.	`0.2`
`valid_split_size`	`float`	The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.	`0.1`
`random_seed`	`int`	The random seed to use for splitting the data.	`42`
`extra_reader_kwargs`	`Optional[dict]`	Extra kwargs for dataset readers.	`None`
`batch_size`	`int`	The batch size.	`128`
`shuffle`	`bool`	Whether to shuffle the data.	`True`
`sampler`	`Optional[Sampler]`	The sampler to use.	`None`
`num_workers`	`int`	The number of workers to use for data loading.	`0`
`collate_fn`	`Optional[callable]`	The function to use for collating data.	`None`
`pin_memory`	`bool`	Whether to pin memory.	`True`
`persistent_workers`	`bool`	Whether to use persistent workers.	`False`
`cv_num_folds`	`int`	The number of cross-validation folds, disables cv when <= 1.	`1`
`cv_test_fold_id`	`int`	The fold id to use for cross-validation evaluation.	`0`
`cv_enable_val_fold`	`bool`	Whether to enable a validation fold.	`True`
`cv_replace_val_fold_as_test_fold`	`bool`	Replace validation fold with test fold. Only used when cv_enable_val_fold is False.	`False`
`cv_fold_id_col`	`Optional[str]`	The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.	`None`
`cv_val_offset`	`int`	The offset applied to cv_test_fold_id to determine val_fold_id.	`1`
`**kwargs`		Additional keyword arguments passed to the parent class.	`{}`

Tissue

`modelgenerator.data.CellWithNeighborDataModule`

Bases: DataInterface

Data module for cell classification with neighbors for AIDO.Tissue.

Note

Each sample includes a feature vector (one of the rows in ) and a single class label (one of the columns in ) The feature vector is concatenated with the feature vectors of its neighbors.

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	required
`filter_columns`	`Optional[List[str]]`	The columns of we want to use. Defaults to None, in which case all columns are used.	`None`
`rename_columns`	`Optional[List[str]]`	Optional list of columns to rename.	`None`
`use_random_neighbor`	`bool`	Whether to use random neighbors.	`False`
`copy_center_as_neighbor`	`bool`	Whether to copy center as a neighbor.	`False`
`neighbor_num`	`int`	Number of neighbors to consider.	`10`
`generate_uid`	`bool`	Whether to generate a unique identifier.	`False`
`config_name`	`Optional[str]`	The name of the HF dataset configuration. Affects how the dataset is loaded.	`None`
`train_split_name`	`Optional[str]`	The name of the training split.	`'train'`
`test_split_name`	`Optional[str]`	The name of the test split. Also used for `mgen predict`.	`'test'`
`valid_split_name`	`Optional[str]`	The name of the validation split.	`None`
`train_split_files`	`Optional[Union[str, List[str]]]`	Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.	`None`
`test_split_files`	`Optional[Union[str, List[str]]]`	Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for `mgen predict`.	`None`
`valid_split_files`	`Optional[Union[str, List[str]]]`	Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.	`None`
`test_split_size`	`float`	The size of the test split. If test_split_name is None, creates a test split of this size from the training split.	`0.2`
`valid_split_size`	`float`	The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.	`0.1`
`random_seed`	`int`	The random seed to use for splitting the data.	`42`
`extra_reader_kwargs`	`Optional[dict]`	Extra kwargs for dataset readers.	`None`
`batch_size`	`int`	The batch size.	`128`
`shuffle`	`bool`	Whether to shuffle the data.	`True`
`sampler`	`Optional[Sampler]`	The sampler to use.	`None`
`num_workers`	`int`	The number of workers to use for data loading.	`0`
`collate_fn`	`Optional[callable]`	The function to use for collating data.	`None`
`pin_memory`	`bool`	Whether to pin memory.	`True`
`persistent_workers`	`bool`	Whether to use persistent workers.	`False`
`cv_num_folds`	`int`	The number of cross-validation folds, disables cv when <= 1.	`1`
`cv_test_fold_id`	`int`	The fold id to use for cross-validation evaluation.	`0`
`cv_enable_val_fold`	`bool`	Whether to enable a validation fold.	`True`
`cv_replace_val_fold_as_test_fold`	`bool`	Replace validation fold with test fold. Only used when cv_enable_val_fold is False.	`False`
`cv_fold_id_col`	`Optional[str]`	The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.	`None`
`cv_val_offset`	`int`	The offset applied to cv_test_fold_id to determine val_fold_id.	`1`
`**kwargs`		Additional keyword arguments passed to the parent class.	`{}`

Multimodal

`modelgenerator.data.IsoformExpression`

Isoform expression prediction benchmarks from the

Note

Manuscript: Multi-modal Transfer Learning between Biological Foundation Models
Data Card: genbio-ai/transcript_isoform_expression_prediction

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	`'genbio-ai/transcript_isoform_expression_prediction'`
`config_name`	`str`	The name of the HF dataset configuration. Affects how the dataset is loaded.	`None`
`x_col`	`Union[str, list]`	The name of columns containing the sequences.	`['dna_seq', 'rna_seq', 'protein_seq']`
`valid_split_name`		The name of the validation split.	`'valid'`
`train_split_files`	`Optional[Union[str, list[str]]]`	Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.	`'train_*.tsv'`
`test_split_files`	`Optional[Union[str, list[str]]]`	Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for `mgen predict`.	`'test.tsv'`
`valid_split_files`	`Optional[Union[str, list[str]]]`	Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.	`'validation.tsv'`
`normalize`	`bool`	Whether to normalize the labels.	`True`
`**kwargs`		Additional keyword arguments for the parent class.	`{}`

Base Classes

`modelgenerator.data.DataInterface`

Bases: LightningDataModule, KFoldMixin

Base class for all data modules in this project. Handles the boilerplate of setting up data loaders.

Note

Subclasses must implement the setup method. All datasets should return a dictionary of data items. To use HF loading, add the HFDatasetLoaderMixin. For any task-specific behaviors, implement transformations using torch.utils.data.Dataset objects. See MLM for an example.

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	required
`config_name`	`Optional[str]`	The name of the HF dataset configuration. Affects how the dataset is loaded.	`None`
`train_split_name`	`Optional[str]`	The name of the training split.	`'train'`
`test_split_name`	`Optional[str]`	The name of the test split. Also used for `mgen predict`.	`'test'`
`valid_split_name`	`Optional[str]`	The name of the validation split.	`None`
`train_split_files`	`Optional[Union[str, List[str]]]`	Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.	`None`
`test_split_files`	`Optional[Union[str, List[str]]]`	Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for `mgen predict`.	`None`
`valid_split_files`	`Optional[Union[str, List[str]]]`	Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.	`None`
`test_split_size`	`float`	The size of the test split. If test_split_name is None, creates a test split of this size from the training split.	`0.2`
`valid_split_size`	`float`	The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.	`0.1`
`random_seed`	`int`	The random seed to use for splitting the data.	`42`
`extra_reader_kwargs`	`Optional[dict]`	Extra kwargs for dataset readers.	`None`
`batch_size`	`int`	The batch size.	`128`
`shuffle`	`bool`	Whether to shuffle the data.	`True`
`sampler`	`Optional[Sampler]`	The sampler to use.	`None`
`num_workers`	`int`	The number of workers to use for data loading.	`0`
`collate_fn`	`Optional[callable]`	The function to use for collating data.	`None`
`pin_memory`	`bool`	Whether to pin memory.	`True`
`persistent_workers`	`bool`	Whether to use persistent workers.	`False`
`cv_num_folds`	`int`	The number of cross-validation folds, disables cv when <= 1.	`1`
`cv_test_fold_id`	`int`	The fold id to use for cross-validation evaluation.	`0`
`cv_enable_val_fold`	`bool`	Whether to enable a validation fold.	`True`
`cv_replace_val_fold_as_test_fold`	`bool`	Replace validation fold with test fold. Only used when cv_enable_val_fold is False.	`False`
`cv_fold_id_col`	`Optional[str]`	The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.	`None`
`cv_val_offset`	`int`	The offset applied to cv_test_fold_id to determine val_fold_id.	`1`

`modelgenerator.data.ColumnRetrievalDataModule`

Simple data module for retrieving and renaming columns from a dataset.

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	required
`config_name`	`Optional[str]`	The name of the HF dataset configuration. Affects how the dataset is loaded.	`None`
`in_cols`	`List[str]`	The name of the columns to retrieve.	`[]`
`out_cols`	`Optional[List[str]]`	The name of the columns to use as the alias for the retrieved columns.	`None`
`train_split_name`	`Optional[str]`	The name of the training split.	`'train'`
`test_split_name`	`Optional[str]`	The name of the test split. Also used for `mgen predict`.	`'test'`
`valid_split_name`	`Optional[str]`	The name of the validation split.	`None`
`train_split_files`	`Optional[Union[str, List[str]]]`	Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.	`None`
`test_split_files`	`Optional[Union[str, List[str]]]`	Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for `mgen predict`.	`None`
`valid_split_files`	`Optional[Union[str, List[str]]]`	Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.	`None`
`test_split_size`	`float`	The size of the test split. If test_split_name is None, creates a test split of this size from the training split.	`0.2`
`valid_split_size`	`float`	The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.	`0.1`
`random_seed`	`int`	The random seed to use for splitting the data.	`42`
`extra_reader_kwargs`	`Optional[dict]`	Extra kwargs for dataset readers.	`None`
`batch_size`	`int`	The batch size.	`128`
`shuffle`	`bool`	Whether to shuffle the data.	`True`
`sampler`	`Optional[Sampler]`	The sampler to use.	`None`
`num_workers`	`int`	The number of workers to use for data loading.	`0`
`collate_fn`	`Optional[callable]`	The function to use for collating data.	`None`
`pin_memory`	`bool`	Whether to pin memory.	`True`
`persistent_workers`	`bool`	Whether to use persistent workers.	`False`
`cv_num_folds`	`int`	The number of cross-validation folds, disables cv when <= 1.	`1`
`cv_test_fold_id`	`int`	The fold id to use for cross-validation evaluation.	`0`
`cv_enable_val_fold`	`bool`	Whether to enable a validation fold.	`True`
`cv_replace_val_fold_as_test_fold`	`bool`	Replace validation fold with test fold. Only used when cv_enable_val_fold is False.	`False`
`cv_fold_id_col`	`Optional[str]`	The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.	`None`
`cv_val_offset`	`int`	The offset applied to cv_test_fold_id to determine val_fold_id.	`1`
`**kwargs`		Additional keyword arguments passed to the parent class.	`{}`

`modelgenerator.data.SequencesDataModule`

Data module for loading a simple dataset of sequences.

Note

Each sample includes a single sequence under key 'sequences' and optionally an 'id' to track outputs.

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	required
`config_name`	`Optional[str]`	The name of the HF dataset configuration. Affects how the dataset is loaded.	`None`
`test_split_name`	`Optional[str]`	The name of the test split. Also used for `mgen predict`.	`None`
`test_split_files`	`Optional[str]`	Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for `mgen predict`.	`None`
`x_col`	`str`	The name of the column containing the sequences.	`'sequence'`
`id_col`	`str`	The name of the column containing the ids.	`'id'`
`**kwargs`		Additional keyword arguments for the parent class.	`{}`

`modelgenerator.data.SequenceClassificationDataModule`

Data module for Hugging Face sequence classification datasets.

Note

Each sample includes a single sequence under key 'sequences' and a single class label under key 'labels'

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	required
`config_name`	`Optional[str]`	The name of the HF dataset configuration. Affects how the dataset is loaded.	`None`
`x_col`	`str`	The name of the column containing the sequences.	`'sequence'`
`y_col`	`str \| List[str]`	The name of the column(s) containing the labels.	`'label'`
`extra_cols`	`List[str] \| None`	Additional columns to include in the dataset.	`None`
`extra_col_aliases`	`List[str] \| None`	The name of the columns to use as the alias for the extra columns.	`None`
`class_filter`	`int \| List[int] \| None`	Filter the dataset to only include samples with the specified class(es).	`None`
`generate_uid`	`bool`	Whether to generate a unique ID for each sample.	`False`
`train_split_name`	`Optional[str]`	The name of the training split.	`'train'`
`test_split_name`	`Optional[str]`	The name of the test split. Also used for `mgen predict`.	`'test'`
`valid_split_name`	`Optional[str]`	The name of the validation split.	`None`
`train_split_files`	`Optional[Union[str, List[str]]]`	Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.	`None`
`test_split_files`	`Optional[Union[str, List[str]]]`	Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for `mgen predict`.	`None`
`valid_split_files`	`Optional[Union[str, List[str]]]`	Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.	`None`
`test_split_size`	`float`	The size of the test split. If test_split_name is None, creates a test split of this size from the training split.	`0.2`
`valid_split_size`	`float`	The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.	`0.1`
`random_seed`	`int`	The random seed to use for splitting the data.	`42`
`extra_reader_kwargs`	`Optional[dict]`	Extra kwargs for dataset readers.	`None`
`batch_size`	`int`	The batch size.	`128`
`shuffle`	`bool`	Whether to shuffle the data.	`True`
`sampler`	`Optional[Sampler]`	The sampler to use.	`None`
`num_workers`	`int`	The number of workers to use for data loading.	`0`
`collate_fn`	`Optional[callable]`	The function to use for collating data.	`None`
`pin_memory`	`bool`	Whether to pin memory.	`True`
`persistent_workers`	`bool`	Whether to use persistent workers.	`False`
`cv_num_folds`	`int`	The number of cross-validation folds, disables cv when <= 1.	`1`
`cv_test_fold_id`	`int`	The fold id to use for cross-validation evaluation.	`0`
`cv_enable_val_fold`	`bool`	Whether to enable a validation fold.	`True`
`cv_replace_val_fold_as_test_fold`	`bool`	Replace validation fold with test fold. Only used when cv_enable_val_fold is False.	`False`
`cv_fold_id_col`	`Optional[str]`	The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.	`None`
`cv_val_offset`	`int`	The offset applied to cv_test_fold_id to determine val_fold_id.	`1`
`**kwargs`		Additional keyword arguments for the parent class.	`{}`

`modelgenerator.data.SequenceRegressionDataModule`

Data module for sequence regression datasets.

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	required
`config_name`	`Optional[str]`	The name of the HF dataset configuration. Affects how the dataset is loaded.	`None`
`x_col`	`str`	The name of columns containing the sequences.	`'sequence'`
`y_col`	`str`	The name of columns containing the labels.	`'label'`
`extra_cols`	`List[str]`	Additional columns to include in the dataset.	`None`
`extra_col_aliases`	`List[str]`	The name of the columns to use as the alias for the extra columns.	`None`
`normalize`	`bool`	Whether to normalize the labels.	`True`
`generate_uid`	`bool`	Whether to generate a unique ID for each sample.	`False`
`train_split_name`	`Optional[str]`	The name of the training split.	`'train'`
`test_split_name`	`Optional[str]`	The name of the test split. Also used for `mgen predict`.	`'test'`
`valid_split_name`	`Optional[str]`	The name of the validation split.	`None`
`train_split_files`	`Optional[Union[str, List[str]]]`	Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.	`None`
`test_split_files`	`Optional[Union[str, List[str]]]`	Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for `mgen predict`.	`None`
`valid_split_files`	`Optional[Union[str, List[str]]]`	Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.	`None`
`test_split_size`	`float`	The size of the test split. If test_split_name is None, creates a test split of this size from the training split.	`0.2`
`valid_split_size`	`float`	The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.	`0.1`
`random_seed`	`int`	The random seed to use for splitting the data.	`42`
`extra_reader_kwargs`	`Optional[dict]`	Extra kwargs for dataset readers.	`None`
`batch_size`	`int`	The batch size.	`128`
`shuffle`	`bool`	Whether to shuffle the data.	`True`
`sampler`	`Optional[Sampler]`	The sampler to use.	`None`
`num_workers`	`int`	The number of workers to use for data loading.	`0`
`collate_fn`	`Optional[callable]`	The function to use for collating data.	`None`
`pin_memory`	`bool`	Whether to pin memory.	`True`
`persistent_workers`	`bool`	Whether to use persistent workers.	`False`
`cv_num_folds`	`int`	The number of cross-validation folds, disables cv when <= 1.	`1`
`cv_test_fold_id`	`int`	The fold id to use for cross-validation evaluation.	`0`
`cv_enable_val_fold`	`bool`	Whether to enable a validation fold.	`True`
`cv_replace_val_fold_as_test_fold`	`bool`	Replace validation fold with test fold. Only used when cv_enable_val_fold is False.	`False`
`cv_fold_id_col`	`Optional[str]`	The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.	`None`
`cv_val_offset`	`int`	The offset applied to cv_test_fold_id to determine val_fold_id.	`1`
`**kwargs`		Additional keyword arguments for the parent class.	`{}`

`modelgenerator.data.TokenClassificationDataModule`

Data module for Hugging Face token classification datasets.

Note

Each sample includes a single sequence under key 'sequences' and a single class sequence under key 'labels'

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	required
`config_name`	`Optional[str]`	The name of the HF dataset configuration. Affects how the dataset is loaded.	`None`
`x_col`	`str`	The name of the column containing the sequences.	`'sequence'`
`y_col`	`str`	The name of the column containing the labels.	`'label'`
`extra_cols`	`List[str] \| None`	Additional columns to include in the dataset.	`None`
`extra_col_aliases`	`List[str] \| None`	The name of the columns to use as the alias for the extra columns.	`None`
`max_length`	`Optional[int]`	The maximum length of the sequences.	`None`
`truncate_extra_cols`	`bool`	Whether to truncate the extra columns to the maximum length.	`False`
`pairwise`	`bool`	Whether the labels are pairwise.	`False`
`collate_fn`	`Optional[callable]`	The function to use for collating data.	`None`
`generate_uid`	`bool`	Whether to generate a unique ID for each sample.	`False`
`train_split_name`	`Optional[str]`	The name of the training split.	`'train'`
`test_split_name`	`Optional[str]`	The name of the test split. Also used for `mgen predict`.	`'test'`
`valid_split_name`	`Optional[str]`	The name of the validation split.	`None`
`train_split_files`	`Optional[Union[str, List[str]]]`	Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.	`None`
`test_split_files`	`Optional[Union[str, List[str]]]`	Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for `mgen predict`.	`None`
`valid_split_files`	`Optional[Union[str, List[str]]]`	Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.	`None`
`test_split_size`	`float`	The size of the test split. If test_split_name is None, creates a test split of this size from the training split.	`0.2`
`valid_split_size`	`float`	The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.	`0.1`
`random_seed`	`int`	The random seed to use for splitting the data.	`42`
`extra_reader_kwargs`	`Optional[dict]`	Extra kwargs for dataset readers.	`None`
`batch_size`	`int`	The batch size.	`128`
`shuffle`	`bool`	Whether to shuffle the data.	`True`
`sampler`	`Optional[Sampler]`	The sampler to use.	`None`
`num_workers`	`int`	The number of workers to use for data loading.	`0`
`pin_memory`	`bool`	Whether to pin memory.	`True`
`persistent_workers`	`bool`	Whether to use persistent workers.	`False`
`cv_num_folds`	`int`	The number of cross-validation folds, disables cv when <= 1.	`1`
`cv_test_fold_id`	`int`	The fold id to use for cross-validation evaluation.	`0`
`cv_enable_val_fold`	`bool`	Whether to enable a validation fold.	`True`
`cv_replace_val_fold_as_test_fold`	`bool`	Replace validation fold with test fold. Only used when cv_enable_val_fold is False.	`False`
`cv_fold_id_col`	`Optional[str]`	The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.	`None`
`cv_val_offset`	`int`	The offset applied to cv_test_fold_id to determine val_fold_id.	`1`
`**kwargs`		Additional keyword arguments for the parent class.	`{}`

`modelgenerator.data.DiffusionDataModule`

Data module for datasets with discrete diffusion-based noising and loss weights from MDLM.

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	required
`config_name`	`Optional[str]`	The name of the HF dataset configuration. Affects how the dataset is loaded.	`None`
`x_col`	`str`	The column with the data to train on.	`'sequence'`
`extra_cols`	`List[str] \| None`	Additional columns to include in the dataset.	`None`
`extra_col_aliases`	`List[str] \| None`	The name of the columns to use as the alias for the extra columns.	`None`
`timesteps_per_sample`	`int`	The number of timesteps per sample.	`10`
`randomize_targets`	`bool`	Whether to randomize the target sequences for each timestep (experimental efficiency boost).	`False`
`batch_size`	`int`	The batch size.	`10`
`train_split_name`	`Optional[str]`	The name of the training split.	`'train'`
`test_split_name`	`Optional[str]`	The name of the test split. Also used for `mgen predict`.	`'test'`
`valid_split_name`	`Optional[str]`	The name of the validation split.	`None`
`train_split_files`	`Optional[Union[str, List[str]]]`	Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.	`None`
`test_split_files`	`Optional[Union[str, List[str]]]`	Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for `mgen predict`.	`None`
`valid_split_files`	`Optional[Union[str, List[str]]]`	Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.	`None`
`test_split_size`	`float`	The size of the test split. If test_split_name is None, creates a test split of this size from the training split.	`0.2`
`valid_split_size`	`float`	The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.	`0.1`
`random_seed`	`int`	The random seed to use for splitting the data.	`42`
`extra_reader_kwargs`	`Optional[dict]`	Extra kwargs for dataset readers.	`None`
`shuffle`	`bool`	Whether to shuffle the data.	`True`
`sampler`	`Optional[Sampler]`	The sampler to use.	`None`
`num_workers`	`int`	The number of workers to use for data loading.	`0`
`collate_fn`	`Optional[callable]`	The function to use for collating data.	`None`
`pin_memory`	`bool`	Whether to pin memory.	`True`
`persistent_workers`	`bool`	Whether to use persistent workers.	`False`
`cv_num_folds`	`int`	The number of cross-validation folds, disables cv when <= 1.	`1`
`cv_test_fold_id`	`int`	The fold id to use for cross-validation evaluation.	`0`
`cv_enable_val_fold`	`bool`	Whether to enable a validation fold.	`True`
`cv_replace_val_fold_as_test_fold`	`bool`	Replace validation fold with test fold. Only used when cv_enable_val_fold is False.	`False`
`cv_fold_id_col`	`Optional[str]`	The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.	`None`
`cv_val_offset`	`int`	The offset applied to cv_test_fold_id to determine val_fold_id.	`1`
`**kwargs`		Additional keyword arguments for the parent class.	`{}`

Notes

Each sample includes timesteps_per_sample sequences at different noise levels Each sample's target sequences are under 'target_sequences', the input sequences are under 'sequences', and posterior weights are under 'posterior_weights'

`modelgenerator.data.ClassDiffusionDataModule`

Data module for conditional (or class-filtered) diffusion, and applying discrete diffusion noising.

Note

Each sample includes timesteps_per_sample sequences at different noise levels Each sample's target sequences are under 'target_seqs', the input sequences are under 'input_seqs', and posterior weights are under 'posterior_weights'

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	required
`config_name`	`Optional[str]`	The name of the HF dataset configuration. Affects how the dataset is loaded.	`None`
`x_col`	`str`	The name of the column containing the sequences.	`'sequence'`
`y_col`	`str \| List[str]`	The name of the column(s) containing the labels.	`'label'`
`timesteps_per_sample`	`int`	The number of timesteps per sample.	`10`
`randomize_targets`	`bool`	Whether to randomize the target sequences for each timestep (experimental efficiency boost).	`False`
`batch_size`	`int`	The batch size.	`10`
`extra_cols`	`List[str] \| None`	Additional columns to include in the dataset.	`None`
`extra_col_aliases`	`List[str] \| None`	The name of the columns to use as the alias for the extra columns.	`None`
`class_filter`	`int \| List[int] \| None`	Filter the dataset to only include samples with the specified class(es).	`None`
`generate_uid`	`bool`	Whether to generate a unique ID for each sample.	`False`
`train_split_name`	`Optional[str]`	The name of the training split.	`'train'`
`test_split_name`	`Optional[str]`	The name of the test split. Also used for `mgen predict`.	`'test'`
`valid_split_name`	`Optional[str]`	The name of the validation split.	`None`
`train_split_files`	`Optional[Union[str, List[str]]]`	Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.	`None`
`test_split_files`	`Optional[Union[str, List[str]]]`	Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for `mgen predict`.	`None`
`valid_split_files`	`Optional[Union[str, List[str]]]`	Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.	`None`
`test_split_size`	`float`	The size of the test split. If test_split_name is None, creates a test split of this size from the training split.	`0.2`
`valid_split_size`	`float`	The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.	`0.1`
`random_seed`	`int`	The random seed to use for splitting the data.	`42`
`extra_reader_kwargs`	`Optional[dict]`	Extra kwargs for dataset readers.	`None`
`shuffle`	`bool`	Whether to shuffle the data.	`True`
`sampler`	`Optional[Sampler]`	The sampler to use.	`None`
`num_workers`	`int`	The number of workers to use for data loading.	`0`
`collate_fn`	`Optional[callable]`	The function to use for collating data.	`None`
`pin_memory`	`bool`	Whether to pin memory.	`True`
`persistent_workers`	`bool`	Whether to use persistent workers.	`False`
`cv_num_folds`	`int`	The number of cross-validation folds, disables cv when <= 1.	`1`
`cv_test_fold_id`	`int`	The fold id to use for cross-validation evaluation.	`0`
`cv_enable_val_fold`	`bool`	Whether to enable a validation fold.	`True`
`cv_replace_val_fold_as_test_fold`	`bool`	Replace validation fold with test fold. Only used when cv_enable_val_fold is False.	`False`
`cv_fold_id_col`	`Optional[str]`	The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.	`None`
`cv_val_offset`	`int`	The offset applied to cv_test_fold_id to determine val_fold_id.	`1`
`**kwargs`		Additional keyword arguments for the parent class.	`{}`

`modelgenerator.data.ConditionalDiffusionDataModule`

Data module for conditional diffusion with a continuous condition, and applying discrete diffusion noising.

Note

Each sample includes timesteps_per_sample sequences at different noise levels Each sample's target sequences are under 'target_seqs', the input sequences are under 'input_seqs', and posterior weights are under 'posterior_weights'

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the dataset, can be (1) a local path to a data folder or (2) a Huggingface dataset identifier	required
`config_name`	`Optional[str]`	The name of the HF dataset configuration. Affects how the dataset is loaded.	`None`
`x_col`	`str`	The name of columns containing the sequences.	`'sequence'`
`y_col`	`str`	The name of columns containing the labels.	`'label'`
`extra_cols`	`List[str]`	Additional columns to include in the dataset.	`None`
`extra_col_aliases`	`List[str]`	The name of the columns to use as the alias for the extra columns.	`None`
`normalize`	`bool`	Whether to normalize the labels.	`True`
`generate_uid`	`bool`	Whether to generate a unique ID for each sample.	`False`
`timesteps_per_sample`	`int`	The number of timesteps per sample.	`10`
`randomize_targets`	`bool`	Whether to randomize the target sequences for each timestep (experimental efficiency boost).	`False`
`batch_size`	`int`	The batch size.	`10`
`train_split_name`	`Optional[str]`	The name of the training split.	`'train'`
`test_split_name`	`Optional[str]`	The name of the test split. Also used for `mgen predict`.	`'test'`
`valid_split_name`	`Optional[str]`	The name of the validation split.	`None`
`train_split_files`	`Optional[Union[str, List[str]]]`	Create a split called "train" from these files. Not used unless referenced by the name "train" in one of the split_name arguments.	`None`
`test_split_files`	`Optional[Union[str, List[str]]]`	Create a split called "test" from these files. Not used unless referenced by the name "test" in one of the split_name arguments. Also used for `mgen predict`.	`None`
`valid_split_files`	`Optional[Union[str, List[str]]]`	Create a split called "valid" from these files. Not used unless referenced by the name "valid" in one of the split_name arguments.	`None`
`test_split_size`	`float`	The size of the test split. If test_split_name is None, creates a test split of this size from the training split.	`0.2`
`valid_split_size`	`float`	The size of the validation split. If valid_split_name is None, creates a validation split of this size from the training split.	`0.1`
`random_seed`	`int`	The random seed to use for splitting the data.	`42`
`extra_reader_kwargs`	`Optional[dict]`	Extra kwargs for dataset readers.	`None`
`shuffle`	`bool`	Whether to shuffle the data.	`True`
`sampler`	`Optional[Sampler]`	The sampler to use.	`None`
`num_workers`	`int`	The number of workers to use for data loading.	`0`
`collate_fn`	`Optional[callable]`	The function to use for collating data.	`None`
`pin_memory`	`bool`	Whether to pin memory.	`True`
`persistent_workers`	`bool`	Whether to use persistent workers.	`False`
`cv_num_folds`	`int`	The number of cross-validation folds, disables cv when <= 1.	`1`
`cv_test_fold_id`	`int`	The fold id to use for cross-validation evaluation.	`0`
`cv_enable_val_fold`	`bool`	Whether to enable a validation fold.	`True`
`cv_replace_val_fold_as_test_fold`	`bool`	Replace validation fold with test fold. Only used when cv_enable_val_fold is False.	`False`
`cv_fold_id_col`	`Optional[str]`	The column name containing the fold id from a pre-split dataset. Setting to None to enable automatic splitting.	`None`
`cv_val_offset`	`int`	The offset applied to cv_test_fold_id to determine val_fold_id.	`1`
`**kwargs`		Additional keyword arguments for the parent class.	`{}`

`modelgenerator.data.MLMDataModule`