Skip to content

Data

modelgenerator.data.SequenceClassificationDataModule

Bases: DataInterface, HFDatasetLoaderMixin

Data module for Hugging Face sequence classification datasets.

Note

Each sample includes a single sequence under key 'sequences' and a single class label under key 'labels'

Parameters:

Name Type Description Default
x_col str

The name of the column containing the sequences. Defaults to "sequence".

'sequence'
y_col str | List[str]

The name of the column(s) containing the labels. Defaults to "label".

'label'
extra_cols List[str] | optional

Additional columns to include in the dataset. Defaults to None.

None
extra_col_aliases List[str]

The name of the columns to use as the alias for the extra columns. Defaults to None.

None
class_filter List[int] | int

The class to filter. Defaults to None.

None
generate_uid bool

Whether to generate a unique ID for each sample. Defaults to False.

False

modelgenerator.data.TokenClassificationDataModule

Bases: DataInterface, HFDatasetLoaderMixin

Data module for Hugging Face token classification datasets.

Note

Each sample includes a single sequence under key 'sequences' and a single class sequence under key 'labels'

Parameters:

Name Type Description Default
x_col str

The name of the column containing the sequences. Defaults to "sequence".

'sequence'
y_col str

The name of the column containing the labels. Defaults to "label".

'label'
extra_cols List[str] | optional

Additional columns to include in the dataset. Defaults to None.

None
extra_col_aliases List[str]

The name of the columns to use as the alias for the extra columns. Defaults to None.

None
max_length int

The maximum length of the sequences. Defaults to None.

None
pairwise bool

Whether the labels are pairwise. Defaults to False.

False
generate_uid bool

Whether to generate a unique ID for each sample. Defaults to False.

False

modelgenerator.data.ClassDiffusionDataModule

Bases: SequenceClassificationDataModule

Data module for conditional (or class-filtered) diffusion, and applying discrete diffusion noising. Inherits from SequenceClassification.

Note

Each sample includes timesteps_per_sample sequences at different noise levels Each sample's target sequences are under 'target_seqs', the input sequences are under 'input_seqs', and posterior weights are under 'posterior_weights'

Parameters:

Name Type Description Default
timesteps_per_sample int

The number of timesteps per sample, defaults to 10

10
randomize_targets bool

Whether to randomize the target sequences for each timestep (experimental efficiency boost proposed by Sazan)

False
batch_size int

The batch size, defaults to 10

10
extra_cols List[str]

Additional columns to include in the dataset, defaults to None

required
extra_col_aliases List[str]

The name of the columns to use as the alias for the extra columns, defaults to None

required

modelgenerator.data.MLMDataModule

Bases: SequenceClassificationDataModule

Data module for continuing pretraining on a masked language modeling task. Inherits from SequenceClassificationDataModule.

Note

Each sample includes a single sequence under key 'sequences' and a single target sequence under key 'target_sequences'

Parameters:

Name Type Description Default
masking_rate float

The masking rate. Defaults to 0.15.

0.15

modelgenerator.data.SequenceRegressionDataModule

Bases: DataInterface, HFDatasetLoaderMixin

Data module sequence regression datasets.

Parameters:

Name Type Description Default
x_col union[str, list]

The name of columns containing the sequences. Defaults to "sequence".

'sequence'
y_col union[str, list]

The name of columns containing the labels. Defaults to "label".

'label'
extra_cols list

Additional columns to include in the dataset. Defaults to None.

None
extra_col_aliases list

The name of the columns to use as the alias for the extra columns. Defaults to None.

None
normalize bool

Whether to normalize the labels. Defaults to True.

True
generate_uid bool

Whether to generate a unique ID for each sample. Defaults to False.

False

modelgenerator.data.ConditionalDiffusionDataModule

Bases: SequenceRegressionDataModule

Data module for conditional diffusion with a continuous condition, and applying discrete diffusion noising. Inherits from SequenceRegression.

Note

Each sample includes timesteps_per_sample sequences at different noise levels Each sample's target sequences are under 'target_seqs', the input sequences are under 'input_seqs', and posterior weights are under 'posterior_weights'

Parameters:

Name Type Description Default
timesteps_per_sample int

The number of timesteps per sample, defaults to 10

10
randomize_targets bool

Whether to randomize the target sequences for each timestep (experimental efficiency boost proposed by Sazan)

False
batch_size int

The batch size, defaults to 10

10

modelgenerator.data.DiffusionDataModule

Bases: DataInterface, HFDatasetLoaderMixin

Data module for datasets with discrete diffusion-based noising and loss weights from MDLM https://arxiv.org/abs/2406.07524.

Notes

Each sample includes timesteps_per_sample sequences at different noise levels Each sample's target sequences are under 'target_sequences', the input sequences are under 'sequences', and posterior weights are under 'posterior_weights'

Parameters:

Name Type Description Default
x_col str

The column with the data to train on, defaults to "sequence"

'sequence'
extra_cols List[str]

Additional columns to include in the dataset, defaults to None

None
extra_col_aliases List[str]

The name of the columns to use as the alias for the extra columns, defaults to None

None
timesteps_per_sample int

The number of timesteps per sample, defaults to 10

10
randomize_targets bool

Whether to randomize the target sequences for each timestep (experimental efficiency boost proposed by Sazan)

False
batch_size int

The batch size, defaults to 10

10

modelgenerator.data.CellClassificationDataModule

Bases: DataInterface

Data module for cell classification. Inherits from BaseDataModule.

Note

Each sample includes a feature vector (one of the rows in ) and a single class label (one of the columns in )

Parameters:

Name Type Description Default
filter_columns Optional[list[str]]

The columns of we want to use. Defaults to None, in which case all columns are used.

None
rename_columns Optional[list[str]]

New name of columns. Defaults to None, in which case columns are not renamed. Does nothing if filter_colums is None.

None
# TODO

Add option to return a subset of genes by filtering on .

required

modelgenerator.data.ClockDataModule

Bases: DataInterface

Data module for transcriptomic clock tasks. Inherits from BaseDataModule.

Note

Each sample includes a feature vector (one of the rows in ) and a single scalar corresponding to donor age (one of the columns in )

Parameters:

Name Type Description Default
split_column str

The column of that defines the split assignments.

required
gene_set_file str

Path to a csv file containing gene symbols in the order expected by the model being used.

required
filter_columns Optional[list[str]]

The columns of we want to use. Defaults to None, in which case all columns are used.

None
rename_columns Optional[list[str]]

New name of columns. Defaults to None, in which case columns are not renamed. Does nothing if filter_colums is None.

None
# TODO

Add option to return a subset of genes by filtering on .

required

modelgenerator.data.PertClassificationDataModule

Bases: DataInterface

Data module for perturbation classification. Inherits from BaseDataModule.

Note

Each sample includes a feature vector (one of the rows in ) and a single class label (one of the columns in )

Parameters:

Name Type Description Default
gene_set_file str

Path to a csv file containing gene symbols in the order expected by the model being used.

required
pert_column str

Column of containing perturbation labels.

required
cell_line_column str

Column of containing cell line labels.

required
cell_line str

Name of cell line to consider.

required
split_seed int

Seed for train/val/test splits.

1234
train_frac float

Fraction of examples to assign to train set.

0.7
val_frac float

Fraction of examples to assign to val set.

0.15
test_frac float

Fraction of examples to assign to test set.

0.15
filter_columns Optional[list[str]]

The columns of we want to use. Defaults to None, in which case all columns are used.

None
rename_columns Optional[list[str]]

New name of columns. Defaults to None, in which case columns are not renamed. Does nothing if filter_colums is None.

None