Zeroshot Variant Effect Prediction

Zeroshot variant effect prediction refers to the task of predicting the functional impact of genetic variants, especially single nucleotide polymorphisms (SNPs), without requiring additional task-specific fine-tuning of the model. AIDO.ModelGenerator implements the procedure proposed by Nucleotide Transformer We use this to predict the effects of single nucleotide polymorphisms (SNPs) in the AIDO.DNA-300M model. This task uses the pre-trained models directly, and does not require finetuning.

ClinvarRetrieve Module automatically download human reference genome, processed clinvar data and demo clinvar data to GENBIO_DATA_DIR/genbio_finetune/dna_datasets from hugginface https://huggingface.co/datasets/genbio-ai/Clinvar.

Note: check if you have already set GENBIO_DATA_DIR as an environment variable on terminal

echo $GENBIO_DATA_DIR

If not, set your own data path

echo 'export GENBIO_DATA_DIR=/your/full/path/' >> ~/.bashrc
source ~/.bashrc

Otherwise, ClinvarRetrieve will automatically set the GENBIO_DATA_DIR environment variable to <root_path>/ModelGenerator/genbio_data.

You can also choose to download raw data and preprocess clinvar data by yourself:

# download hg38 reference genome
wget -P $GENBIO_DATA_DIR/genbio_finetune/dna_datasets https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
gunzip $GENBIO_DATA_DIR/genbio_finetune/dna_datasets/hg38.fa.gz

# download raw clinvar dataset
wget -P $GENBIO_DATA_DIR/genbio_finetune/dna_datasets https://hgdownload.soe.ucsc.edu/gbdb/hg38/bbi/clinvar/clinvarMain.bb

# download package to convert .bb file to .bed file
wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/bigBedToBed
chmod +x bigBedToBed
./bigBedToBed $GENBIO_DATA_DIR/genbio_finetune/dna_datasets/clinvarMain.bb $GENBIO_DATA_DIR/genbio_finetune/dna_datasets/clinvarMain.bed

# Preprocess the raw ClinVar dataset to retain only the following fields: 'chrom', 'start', 'end', 'name', '_clinSignCode', 'ref', 'mutate', and 'effect'.
# The processed file is saved as Clinvar_Processed.tsv.
cd ModelGenerator
python experiments/AIDO.DNA/zeroshot_variant_effect_prediction/preprocess_clinvar.py $GENBIO_DATA_DIR/genbio_finetune/dna_datasets/clinvarMain.bed

Run mgen test --config config.yaml. If you want to take the norm distance between reference and variant sequence embeddings as prediction, the config should be

model:
  class_path: modelgenerator.tasks.ZeroshotPredictionDistance
  init_args:
    backbone: <you-choose>
data:
  class_path: ClinvarRetrieve
  init_args:
    method: Distance
    window: <window size centered around the SNPs>
    test_split_files:
      - <my_sequences.tsv>
    reference_file: <human_reference_genome.ml.fa> # Example: hg38.ml.fa
trainer:
  callbacks:
  - class_path: modelgenerator.callbacks.PredictionWriter
    dict_kwargs:
      output_dir: predictions
      filetype: tsv
      write_cols: ['score','norm_type','labels','num_layer']

If you take the loglikelihood ratio between reference and variant sequence embeddings at the mutation position as prediction, then the config should be

model:
  class_path: modelgenerator.tasks.ZeroshotPredictionDiff
  init_args:
    backbone: <you-choose>
data:
  class_path: ClinvarRetrieve
  init_args:
    method: Diff
    window: <window size centered around the SNPs>
    test_split_files:
      - <my_sequences.tsv>
    reference_file: <human_reference_genome.ml.fa> # Example: hg38.ml.fa
trainer:
  callbacks:
  - class_path: modelgenerator.callbacks.PredictionWriter
    dict_kwargs:
      output_dir: predictions
      filetype: tsv
      write_cols: ['score','label']

The labels and scores are also saved in test_predictions.tsv under the dir specified by --trainer.callbacks.output_dir.

Here are two examples of how to load HF model for inference For norm distance mode

mgen test --config experiments/AIDO.DNA/zeroshot_variant_effect_prediction/Clinvar_300M_zeroshot_Distance.yaml

For loglikelihood ratio mode

mgen test --config experiments/AIDO.DNA/zeroshot_variant_effect_prediction/Clinvar_300M_zeroshot_Diff.yaml