K-fold cross validation

Datasets implementing the DataInterface with the KFoldMixin support semi-automatic k-fold crossvalidation for uncertainty estimation.

We use translation efficiency prediction as an example task to demonstrate how to do a k-fold cross validation in ModelGenerator. The logic is to split the dataset into k-fold, and call each fold as a test set iteratively.

Data configs

For cross validation task, we input only one dataset named train containing a colomn fold_id indicating the fold index for each sample. You need to set cv_num_folds, cv_test_fold_id, cv_enable_val_fold, cv_fold_id_col according to your experiment setting.

data:
  class_path: modelgenerator.data.TranslationEfficiency
  init_args:
    path: genbio-ai/rna-downstream-tasks
    config_name: translation_efficiency_Muscle
    normalize: true
    train_split_name: train
    random_seed: 42
    batch_size: 8
    shuffle: true
    cv_num_folds: 10
    cv_test_fold_id: 0
    cv_enable_val_fold: true
    cv_fold_id_col: fold_id

See experiments/AIDO.RNA/configs/translation_efficiency.yaml for full hyperparameter settings.

Finetuning script

for FOLD in {0..9}
  do
    RUN_NAME=te_Muscle_aido_rna_1b600m_fold${FOLD}
    CKPT_SAVE_DIR=logs/rna_tasks/${RUN_NAME}
    CUDA_VISIBLE_DEVICES=0 mgen fit --config experiments/AIDO.RNA/configs/translation_efficiency.yaml \
      --data.config_name translation_efficiency_Muscle \
      --data.cv_test_fold_id $FOLD \
      --trainer.logger.name $RUN_NAME \
      --trainer.callbacks.dirpath $CKPT_SAVE_DIR
  done

Evaluation script

for FOLD in {0..9}
do
  CKPT_PATH=logs/rna_tasks/te_Muscle_aido_rna_1b600m_fold${FOLD}/best_val*
  echo ">>> Fold ${FOLD}"
  mgen test --config experiments/AIDO.RNA/configs/translation_efficiency.yaml \
    --data.config_name translation_efficiency_Muscle \
    --data.cv_test_fold_id $FOLD \
    --model.strict_loading True \
    --model.reset_optimizer_states True \
    --trainer.logger null \
    --ckpt_path $CKPT_PATH
done