Data
modelgenerator.data.SequenceClassificationDataModule
Bases: DataInterface
, HFDatasetLoaderMixin
Data module for Hugging Face sequence classification datasets.
Note
Each sample includes a single sequence under key 'sequences' and a single class label under key 'labels'
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x_col
|
str
|
The name of the column containing the sequences. Defaults to "sequence". |
'sequence'
|
y_col
|
str | List[str]
|
The name of the column(s) containing the labels. Defaults to "label". |
'label'
|
class_filter
|
List[int] | int
|
The class to filter. Defaults to None. |
None
|
modelgenerator.data.TokenClassificationDataModule
Bases: DataInterface
, HFDatasetLoaderMixin
Data module for Hugging Face token classification datasets.
Note
Each sample includes a single sequence under key 'sequences' and a single class sequence under key 'labels'
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x_col
|
str
|
The name of the column containing the sequences. Defaults to "sequence". |
'sequence'
|
y_col
|
str
|
The name of the column containing the labels. Defaults to "label". |
'label'
|
max_length
|
int
|
The maximum length of the sequences. Defaults to None. |
None
|
pairwise
|
bool
|
Whether the labels are pairwise. Defaults to False. |
False
|
modelgenerator.data.ClassDiffusionDataModule
Bases: SequenceClassificationDataModule
Data module for conditional (or class-filtered) diffusion, and applying discrete diffusion noising. Inherits from SequenceClassification.
Note
Each sample includes timesteps_per_sample sequences at different noise levels Each sample's target sequences are under 'target_seqs', the input sequences are under 'input_seqs', and posterior weights are under 'posterior_weights'
Parameters:
Name | Type | Description | Default |
---|---|---|---|
timesteps_per_sample
|
int
|
The number of timesteps per sample, defaults to 10 |
10
|
randomize_targets
|
bool
|
Whether to randomize the target sequences for each timestep (experimental efficiency boost proposed by Sazan) |
False
|
batch_size
|
int
|
The batch size, defaults to 10 |
10
|
modelgenerator.data.MLMDataModule
Bases: SequenceClassificationDataModule
Data module for continuing pretraining on a masked language modeling task. Inherits from SequenceClassificationDataModule.
Note
Each sample includes a single sequence under key 'sequences' and a single target sequence under key 'target_sequences'
Parameters:
Name | Type | Description | Default |
---|---|---|---|
masking_rate
|
float
|
The masking rate. Defaults to 0.15. |
0.15
|
modelgenerator.data.SequenceRegressionDataModule
Bases: DataInterface
, HFDatasetLoaderMixin
Data module sequence regression datasets.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x_col
|
str
|
The name of the column containing the sequences. Defaults to "sequence". |
'sequence'
|
y_col
|
str
|
The name of the column containing the labels. Defaults to "label". |
'label'
|
normalize
|
bool
|
Whether to normalize the labels. Defaults to True. |
True
|
modelgenerator.data.ConditionalDiffusionDataModule
Bases: SequenceRegressionDataModule
Data module for conditional diffusion with a continuous condition, and applying discrete diffusion noising. Inherits from SequenceRegression.
Note
Each sample includes timesteps_per_sample sequences at different noise levels Each sample's target sequences are under 'target_seqs', the input sequences are under 'input_seqs', and posterior weights are under 'posterior_weights'
Parameters:
Name | Type | Description | Default |
---|---|---|---|
timesteps_per_sample
|
int
|
The number of timesteps per sample, defaults to 10 |
10
|
randomize_targets
|
bool
|
Whether to randomize the target sequences for each timestep (experimental efficiency boost proposed by Sazan) |
False
|
batch_size
|
int
|
The batch size, defaults to 10 |
10
|
modelgenerator.data.DiffusionDataModule
Bases: DataInterface
, HFDatasetLoaderMixin
Data module for datasets with discrete diffusion-based noising and loss weights from MDLM https://arxiv.org/abs/2406.07524.
Notes
Each sample includes timesteps_per_sample sequences at different noise levels Each sample's target sequences are under 'target_sequences', the input sequences are under 'sequences', and posterior weights are under 'posterior_weights'
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x_col
|
str
|
The column with the data to train on, defaults to "sequence" |
'sequence'
|
timesteps_per_sample
|
int
|
The number of timesteps per sample, defaults to 10 |
10
|
randomize_targets
|
bool
|
Whether to randomize the target sequences for each timestep (experimental efficiency boost proposed by Sazan) |
False
|
batch_size
|
int
|
The batch size, defaults to 10 |
10
|
modelgenerator.data.CellClassificationDataModule
Bases: DataInterface
Data module for cell classification. Inherits from BaseDataModule.
Note
Each sample includes a feature vector (one of the rows in
Parameters:
Name | Type | Description | Default |
---|---|---|---|
filter_columns
|
Optional[list[str]]
|
The columns of |
None
|
rename_columns
|
Optional[list[str]]
|
New name of columns. Defaults to None, in which case columns are not renamed. Does nothing if filter_colums is None. |
None
|
#
|
TODO
|
Add option to return a subset of genes by filtering on . |
required |