Data
modelgenerator.data.SequenceClassificationDataModule
Bases: DataInterface
, HFDatasetLoaderMixin
Data module for Hugging Face sequence classification datasets.
Note
Each sample includes a single sequence under key 'sequences' and a single class label under key 'labels'
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x_col
|
str
|
The name of the column containing the sequences. Defaults to "sequence". |
'sequence'
|
y_col
|
str | List[str]
|
The name of the column(s) containing the labels. Defaults to "label". |
'label'
|
extra_cols
|
List[str] | optional
|
Additional columns to include in the dataset. Defaults to None. |
None
|
extra_col_aliases
|
List[str]
|
The name of the columns to use as the alias for the extra columns. Defaults to None. |
None
|
class_filter
|
List[int] | int
|
The class to filter. Defaults to None. |
None
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. Defaults to False. |
False
|
modelgenerator.data.TokenClassificationDataModule
Bases: DataInterface
, HFDatasetLoaderMixin
Data module for Hugging Face token classification datasets.
Note
Each sample includes a single sequence under key 'sequences' and a single class sequence under key 'labels'
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x_col
|
str
|
The name of the column containing the sequences. Defaults to "sequence". |
'sequence'
|
y_col
|
str
|
The name of the column containing the labels. Defaults to "label". |
'label'
|
extra_cols
|
List[str] | optional
|
Additional columns to include in the dataset. Defaults to None. |
None
|
extra_col_aliases
|
List[str]
|
The name of the columns to use as the alias for the extra columns. Defaults to None. |
None
|
max_length
|
int
|
The maximum length of the sequences. Defaults to None. |
None
|
pairwise
|
bool
|
Whether the labels are pairwise. Defaults to False. |
False
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. Defaults to False. |
False
|
modelgenerator.data.ClassDiffusionDataModule
Bases: SequenceClassificationDataModule
Data module for conditional (or class-filtered) diffusion, and applying discrete diffusion noising. Inherits from SequenceClassification.
Note
Each sample includes timesteps_per_sample sequences at different noise levels Each sample's target sequences are under 'target_seqs', the input sequences are under 'input_seqs', and posterior weights are under 'posterior_weights'
Parameters:
Name | Type | Description | Default |
---|---|---|---|
timesteps_per_sample
|
int
|
The number of timesteps per sample, defaults to 10 |
10
|
randomize_targets
|
bool
|
Whether to randomize the target sequences for each timestep (experimental efficiency boost proposed by Sazan) |
False
|
batch_size
|
int
|
The batch size, defaults to 10 |
10
|
extra_cols
|
List[str]
|
Additional columns to include in the dataset, defaults to None |
required |
extra_col_aliases
|
List[str]
|
The name of the columns to use as the alias for the extra columns, defaults to None |
required |
modelgenerator.data.MLMDataModule
Bases: SequenceClassificationDataModule
Data module for continuing pretraining on a masked language modeling task. Inherits from SequenceClassificationDataModule.
Note
Each sample includes a single sequence under key 'sequences' and a single target sequence under key 'target_sequences'
Parameters:
Name | Type | Description | Default |
---|---|---|---|
masking_rate
|
float
|
The masking rate. Defaults to 0.15. |
0.15
|
modelgenerator.data.SequenceRegressionDataModule
Bases: DataInterface
, HFDatasetLoaderMixin
Data module sequence regression datasets.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x_col
|
union[str, list]
|
The name of columns containing the sequences. Defaults to "sequence". |
'sequence'
|
y_col
|
union[str, list]
|
The name of columns containing the labels. Defaults to "label". |
'label'
|
extra_cols
|
list
|
Additional columns to include in the dataset. Defaults to None. |
None
|
extra_col_aliases
|
list
|
The name of the columns to use as the alias for the extra columns. Defaults to None. |
None
|
normalize
|
bool
|
Whether to normalize the labels. Defaults to True. |
True
|
generate_uid
|
bool
|
Whether to generate a unique ID for each sample. Defaults to False. |
False
|
modelgenerator.data.ConditionalDiffusionDataModule
Bases: SequenceRegressionDataModule
Data module for conditional diffusion with a continuous condition, and applying discrete diffusion noising. Inherits from SequenceRegression.
Note
Each sample includes timesteps_per_sample sequences at different noise levels Each sample's target sequences are under 'target_seqs', the input sequences are under 'input_seqs', and posterior weights are under 'posterior_weights'
Parameters:
Name | Type | Description | Default |
---|---|---|---|
timesteps_per_sample
|
int
|
The number of timesteps per sample, defaults to 10 |
10
|
randomize_targets
|
bool
|
Whether to randomize the target sequences for each timestep (experimental efficiency boost proposed by Sazan) |
False
|
batch_size
|
int
|
The batch size, defaults to 10 |
10
|
modelgenerator.data.DiffusionDataModule
Bases: DataInterface
, HFDatasetLoaderMixin
Data module for datasets with discrete diffusion-based noising and loss weights from MDLM https://arxiv.org/abs/2406.07524.
Notes
Each sample includes timesteps_per_sample sequences at different noise levels Each sample's target sequences are under 'target_sequences', the input sequences are under 'sequences', and posterior weights are under 'posterior_weights'
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x_col
|
str
|
The column with the data to train on, defaults to "sequence" |
'sequence'
|
extra_cols
|
List[str]
|
Additional columns to include in the dataset, defaults to None |
None
|
extra_col_aliases
|
List[str]
|
The name of the columns to use as the alias for the extra columns, defaults to None |
None
|
timesteps_per_sample
|
int
|
The number of timesteps per sample, defaults to 10 |
10
|
randomize_targets
|
bool
|
Whether to randomize the target sequences for each timestep (experimental efficiency boost proposed by Sazan) |
False
|
batch_size
|
int
|
The batch size, defaults to 10 |
10
|
modelgenerator.data.CellClassificationDataModule
Bases: DataInterface
Data module for cell classification. Inherits from BaseDataModule.
Note
Each sample includes a feature vector (one of the rows in
Parameters:
Name | Type | Description | Default |
---|---|---|---|
filter_columns
|
Optional[list[str]]
|
The columns of |
None
|
rename_columns
|
Optional[list[str]]
|
New name of columns. Defaults to None, in which case columns are not renamed. Does nothing if filter_colums is None. |
None
|
#
|
TODO
|
Add option to return a subset of genes by filtering on . |
required |
modelgenerator.data.ClockDataModule
Bases: DataInterface
Data module for transcriptomic clock tasks. Inherits from BaseDataModule.
Note
Each sample includes a feature vector (one of the rows in
Parameters:
Name | Type | Description | Default |
---|---|---|---|
split_column
|
str
|
The column of |
required |
gene_set_file
|
str
|
Path to a csv file containing gene symbols in the order expected by the model being used. |
required |
filter_columns
|
Optional[list[str]]
|
The columns of |
None
|
rename_columns
|
Optional[list[str]]
|
New name of columns. Defaults to None, in which case columns are not renamed. Does nothing if filter_colums is None. |
None
|
#
|
TODO
|
Add option to return a subset of genes by filtering on . |
required |
modelgenerator.data.PertClassificationDataModule
Bases: DataInterface
Data module for perturbation classification. Inherits from BaseDataModule.
Note
Each sample includes a feature vector (one of the rows in
Parameters:
Name | Type | Description | Default |
---|---|---|---|
gene_set_file
|
str
|
Path to a csv file containing gene symbols in the order expected by the model being used. |
required |
pert_column
|
str
|
Column of |
required |
cell_line_column
|
str
|
Column of |
required |
cell_line
|
str
|
Name of cell line to consider. |
required |
split_seed
|
int
|
Seed for train/val/test splits. |
1234
|
train_frac
|
float
|
Fraction of examples to assign to train set. |
0.7
|
val_frac
|
float
|
Fraction of examples to assign to val set. |
0.15
|
test_frac
|
float
|
Fraction of examples to assign to test set. |
0.15
|
filter_columns
|
Optional[list[str]]
|
The columns of |
None
|
rename_columns
|
Optional[list[str]]
|
New name of columns. Defaults to None, in which case columns are not renamed. Does nothing if filter_colums is None. |
None
|