Zeroshot Variant Effect Prediction
Zeroshot variant effect prediction refers to the task of predicting the functional impact of genetic variants, especially single nucleotide polymorphisms (SNPs), without requiring additional task-specific fine-tuning of the model. AIDO.ModelGenerator implements the procedure proposed by Nucleotide Transformer We use this to predict the effects of single nucleotide polymorphisms (SNPs) in the AIDO.DNA-300M model. This task uses the pre-trained models directly, and does not require finetuning.
ClinvarRetrieve
Module automatically download human reference genome, processed clinvar data and demo clinvar data toGENBIO_DATA_DIR/genbio_finetune/dna_datasets
from hugginface https://huggingface.co/datasets/genbio-ai/Clinvar.
Note: check if you have already set GENBIO_DATA_DIR
as an environment variable on terminal
echo $GENBIO_DATA_DIR
If not, set your own data path
echo 'export GENBIO_DATA_DIR=/your/full/path/' >> ~/.bashrc
source ~/.bashrc
Otherwise, ClinvarRetrieve
will automatically set the GENBIO_DATA_DIR
environment variable to <root_path>/ModelGenerator/genbio_data
.
- You can also choose to download raw data and preprocess clinvar data by yourself:
# download hg38 reference genome
wget -P $GENBIO_DATA_DIR/genbio_finetune/dna_datasets https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
gunzip $GENBIO_DATA_DIR/genbio_finetune/dna_datasets/hg38.fa.gz
# download raw clinvar dataset
wget -P $GENBIO_DATA_DIR/genbio_finetune/dna_datasets https://hgdownload.soe.ucsc.edu/gbdb/hg38/bbi/clinvar/clinvarMain.bb
# download package to convert .bb file to .bed file
wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/bigBedToBed
chmod +x bigBedToBed
./bigBedToBed $GENBIO_DATA_DIR/genbio_finetune/dna_datasets/clinvarMain.bb $GENBIO_DATA_DIR/genbio_finetune/dna_datasets/clinvarMain.bed
# Preprocess the raw ClinVar dataset to retain only the following fields: 'chrom', 'start', 'end', 'name', '_clinSignCode', 'ref', 'mutate', and 'effect'.
# The processed file is saved as Clinvar_Processed.tsv.
cd ModelGenerator
python experiments/AIDO.DNA/zeroshot_variant_effect_prediction/preprocess_clinvar.py $GENBIO_DATA_DIR/genbio_finetune/dna_datasets/clinvarMain.bed
- Run
mgen test --config config.yaml
. If you want to take the norm distance between reference and variant sequence embeddings as prediction, the config should be
model:
class_path: modelgenerator.tasks.ZeroshotPredictionDistance
init_args:
backbone: <you-choose>
data:
class_path: ClinvarRetrieve
init_args:
method: Distance
window: <window size centered around the SNPs>
test_split_files:
- <my_sequences.tsv>
reference_file: <human_reference_genome.ml.fa> # Example: hg38.ml.fa
trainer:
callbacks:
- class_path: modelgenerator.callbacks.PredictionWriter
dict_kwargs:
output_dir: predictions
filetype: tsv
write_cols: ['score','norm_type','labels','num_layer']
If you take the loglikelihood ratio between reference and variant sequence embeddings at the mutation position as prediction, then the config should be
model:
class_path: modelgenerator.tasks.ZeroshotPredictionDiff
init_args:
backbone: <you-choose>
data:
class_path: ClinvarRetrieve
init_args:
method: Diff
window: <window size centered around the SNPs>
test_split_files:
- <my_sequences.tsv>
reference_file: <human_reference_genome.ml.fa> # Example: hg38.ml.fa
trainer:
callbacks:
- class_path: modelgenerator.callbacks.PredictionWriter
dict_kwargs:
output_dir: predictions
filetype: tsv
write_cols: ['score','label']
The labels and scores are also saved in test_predictions.tsv
under the dir specified by --trainer.callbacks.output_dir
.
Here are two examples of how to load HF model for inference For norm distance mode
mgen test --config experiments/AIDO.DNA/zeroshot_variant_effect_prediction/Clinvar_300M_zeroshot_Distance.yaml
For loglikelihood ratio mode
mgen test --config experiments/AIDO.DNA/zeroshot_variant_effect_prediction/Clinvar_300M_zeroshot_Diff.yaml