K-fold cross validation
Datasets implementing the DataInterface
with the KFoldMixin
support semi-automatic k-fold crossvalidation for uncertainty estimation.
We use translation efficiency prediction as an example task to demonstrate how to do a k-fold cross validation in ModelGenerator. The logic is to split the dataset into k-fold, and call each fold as a test set iteratively.
Data configs
For cross validation task, we input only one dataset named train
containing a colomn fold_id
indicating the fold index for each sample. You need to set cv_num_folds
, cv_test_fold_id
, cv_enable_val_fold
, cv_fold_id_col
according to your experiment setting.
data:
class_path: modelgenerator.data.TranslationEfficiency
init_args:
path: genbio-ai/rna-downstream-tasks
config_name: translation_efficiency_Muscle
normalize: true
train_split_name: train
random_seed: 42
batch_size: 8
shuffle: true
cv_num_folds: 10
cv_test_fold_id: 0
cv_enable_val_fold: true
cv_fold_id_col: fold_id
See experiments/AIDO.RNA/configs/translation_efficiency.yaml
for full hyperparameter settings.
Finetuning script
for FOLD in {0..9}
do
RUN_NAME=te_Muscle_aido_rna_1b600m_fold${FOLD}
CKPT_SAVE_DIR=logs/rna_tasks/${RUN_NAME}
CUDA_VISIBLE_DEVICES=0 mgen fit --config experiments/AIDO.RNA/configs/translation_efficiency.yaml \
--data.config_name translation_efficiency_Muscle \
--data.cv_test_fold_id $FOLD \
--trainer.logger.name $RUN_NAME \
--trainer.callbacks.dirpath $CKPT_SAVE_DIR
done
Evaluation script
for FOLD in {0..9}
do
CKPT_PATH=logs/rna_tasks/te_Muscle_aido_rna_1b600m_fold${FOLD}/best_val*
echo ">>> Fold ${FOLD}"
mgen test --config experiments/AIDO.RNA/configs/translation_efficiency.yaml \
--data.config_name translation_efficiency_Muscle \
--data.cv_test_fold_id $FOLD \
--model.strict_loading True \
--model.reset_optimizer_states True \
--trainer.logger null \
--ckpt_path $CKPT_PATH
done