Skip to content

Data

modelgenerator.data.SequenceClassificationDataModule

Bases: DataInterface, HFDatasetLoaderMixin

Data module for Hugging Face sequence classification datasets.

Note

Each sample includes a single sequence under key 'sequences' and a single class label under key 'labels'

Parameters:

Name Type Description Default
x_col str

The name of the column containing the sequences. Defaults to "sequence".

'sequence'
y_col str | List[str]

The name of the column(s) containing the labels. Defaults to "label".

'label'
class_filter List[int] | int

The class to filter. Defaults to None.

None

modelgenerator.data.TokenClassificationDataModule

Bases: DataInterface, HFDatasetLoaderMixin

Data module for Hugging Face token classification datasets.

Note

Each sample includes a single sequence under key 'sequences' and a single class sequence under key 'labels'

Parameters:

Name Type Description Default
x_col str

The name of the column containing the sequences. Defaults to "sequence".

'sequence'
y_col str

The name of the column containing the labels. Defaults to "label".

'label'
max_length int

The maximum length of the sequences. Defaults to None.

None
pairwise bool

Whether the labels are pairwise. Defaults to False.

False

modelgenerator.data.ClassDiffusionDataModule

Bases: SequenceClassificationDataModule

Data module for conditional (or class-filtered) diffusion, and applying discrete diffusion noising. Inherits from SequenceClassification.

Note

Each sample includes timesteps_per_sample sequences at different noise levels Each sample's target sequences are under 'target_seqs', the input sequences are under 'input_seqs', and posterior weights are under 'posterior_weights'

Parameters:

Name Type Description Default
timesteps_per_sample int

The number of timesteps per sample, defaults to 10

10
randomize_targets bool

Whether to randomize the target sequences for each timestep (experimental efficiency boost proposed by Sazan)

False
batch_size int

The batch size, defaults to 10

10

modelgenerator.data.MLMDataModule

Bases: SequenceClassificationDataModule

Data module for continuing pretraining on a masked language modeling task. Inherits from SequenceClassificationDataModule.

Note

Each sample includes a single sequence under key 'sequences' and a single target sequence under key 'target_sequences'

Parameters:

Name Type Description Default
masking_rate float

The masking rate. Defaults to 0.15.

0.15

modelgenerator.data.SequenceRegressionDataModule

Bases: DataInterface, HFDatasetLoaderMixin

Data module sequence regression datasets.

Parameters:

Name Type Description Default
x_col str

The name of the column containing the sequences. Defaults to "sequence".

'sequence'
y_col str

The name of the column containing the labels. Defaults to "label".

'label'
normalize bool

Whether to normalize the labels. Defaults to True.

True

modelgenerator.data.ConditionalDiffusionDataModule

Bases: SequenceRegressionDataModule

Data module for conditional diffusion with a continuous condition, and applying discrete diffusion noising. Inherits from SequenceRegression.

Note

Each sample includes timesteps_per_sample sequences at different noise levels Each sample's target sequences are under 'target_seqs', the input sequences are under 'input_seqs', and posterior weights are under 'posterior_weights'

Parameters:

Name Type Description Default
timesteps_per_sample int

The number of timesteps per sample, defaults to 10

10
randomize_targets bool

Whether to randomize the target sequences for each timestep (experimental efficiency boost proposed by Sazan)

False
batch_size int

The batch size, defaults to 10

10

modelgenerator.data.DiffusionDataModule

Bases: DataInterface, HFDatasetLoaderMixin

Data module for datasets with discrete diffusion-based noising and loss weights from MDLM https://arxiv.org/abs/2406.07524.

Notes

Each sample includes timesteps_per_sample sequences at different noise levels Each sample's target sequences are under 'target_sequences', the input sequences are under 'sequences', and posterior weights are under 'posterior_weights'

Parameters:

Name Type Description Default
x_col str

The column with the data to train on, defaults to "sequence"

'sequence'
timesteps_per_sample int

The number of timesteps per sample, defaults to 10

10
randomize_targets bool

Whether to randomize the target sequences for each timestep (experimental efficiency boost proposed by Sazan)

False
batch_size int

The batch size, defaults to 10

10

modelgenerator.data.CellClassificationDataModule

Bases: DataInterface

Data module for cell classification. Inherits from BaseDataModule.

Note

Each sample includes a feature vector (one of the rows in ) and a single class label (one of the columns in )

Parameters:

Name Type Description Default
filter_columns Optional[list[str]]

The columns of we want to use. Defaults to None, in which case all columns are used.

None
rename_columns Optional[list[str]]

New name of columns. Defaults to None, in which case columns are not renamed. Does nothing if filter_colums is None.

None
# TODO

Add option to return a subset of genes by filtering on .

required