RNA Secondary Structure Prediction
As with proteins, structure determines RNA function. RNA secondary structure, formed by base pairing, is more stable and accessible than its tertiary form within cells. Accurate prediction of RNA secondary structure is essential for tasks such as higher-order structure prediction and function prediction. As discussed in our paper AIDO.RNA, we finetune the AIDO.RNA-1.6B model on the training splits of the following two datasets: 1. bpRNA 2. Archive-II
We preprocessed and split the datasets (into train, test, and validation splits) in the same way as done in a previous study RiNALMo.
To finetune AIDO.RNA-1.6B on RNA SS:
-
Set the environment variable for ModelGenerator's data directory:
export MGEN_DATA_DIR=~/mgen_data # or any other local directory of your choice
-
Download the preprocessed data (provided as zip file named
rna_ss_data.zip
) from here. Unziprna_ss_data.zip
inside the directory${MGEN_DATA_DIR}/modelgenerator/datasets/
.Alternatively, you can simply run the following script to do this:
mkdir -p ${MGEN_DATA_DIR}/modelgenerator/datasets/ wget -P ${MGEN_DATA_DIR}/modelgenerator/datasets/ https://huggingface.co/datasets/genbio-ai/rna-secondary-structure-prediction/resolve/main/rna_ss_data.zip unzip ${MGEN_DATA_DIR}/modelgenerator/datasets/rna_ss_data.zip -d ${MGEN_DATA_DIR}/modelgenerator/datasets/
You should find two sub-folders containing the preprocessed datasets: 1. bpRNA:
${MGEN_DATA_DIR}/modelgenerator/datasets/rna_ss_data/bpRNA
2. Archive-II:${MGEN_DATA_DIR}/modelgenerator/datasets/rna_ss_data/archiveII
-
Then run a finetuning job on either dataset as following (Note that here we are using finetuning scheduler. See this tutorial for details):
- To train on bpRNA dataset, run the following command:
bash rna_secondary_structure_prediction.sh train bpRNA
- Alternatively, to finetune on Archive-II datasets (for the inter-family generalization experiment discussed in the paper AIDO.RNA), run the following command:
bash rna_secondary_structure_prediction.sh train archiveII_<FamilyName>
Here,<FamilyName>
is any of the following nine strings (representing different RNA families in Archive-II dataset):5s, 16s, 23s, grp1, srp, telomerase, RNaseP, tmRNA, tRNA
. Note that, following the conventioned using by RiNALMo's code repository, when a<FamilyName>
is chosen, it will only be used as the test set and the rest of the families are used for training and validation. One example finetuning run with5s
family:bash rna_secondary_structure_prediction.sh train archiveII_5s
Here, the AIDO.RNA-1.6B model will be finetuned using all other splits except archiveII_5s.
- To train on bpRNA dataset, run the following command:
To test a finetuned checkpoint on RNA SS:
- Finetune AIDO.RNA-1.6B as discussed above, or download the
model.ckpt
checkpoint from here. - Test the checkpoint on the corresponding dataset as following (replace
/path/to/checkpoint
with the actual path to the finetuned checkpoint):- To test on bpRNA dataset, run the following command:
bash rna_secondary_structure_prediction.sh test bpRNA /path/to/checkpoint
- Alternatively, to test on Archive-II datasets, run the following command:
bash rna_secondary_structure_prediction.sh test archiveII_<FamilyName> /path/to/checkpoint
See the previous section for details on<FamilyName>
.
- To test on bpRNA dataset, run the following command:
Outputs:
- The evaluation scores will be printed on the console.