{ "metadata": {}, "nbformat": 4, "nbformat_minor": 5, "cells": [ { "id": "metadata", "cell_type": "markdown", "source": "
Predicting the impact of mutations is a critical task in genomics, as it provides insights into how genetic variations influence biological functions and contribute to diseases. Traditional methods for assessing mutation impact often rely on extensive experimental data or computationally intensive simulations. However, with the advent of large language models (LLMs) and zero-shot learning, we can now predict mutation impacts more efficiently and effectively.
\nZero-shot learning is a technique that allows a pre-trained model to make predictions on tasks it wasn’t explicitly trained for, leveraging its existing knowledge. This approach is particularly valuable when labeled data is scarce or when rapid predictions are needed. By using a pre-trained DNA LLM, we can compute embeddings for both wild-type and mutated DNA sequences and compare them to quantify the impact of mutations.
\nThis tutorial focuses on this innovative method, utilizing a pre-trained DNA LLM available on Hugging Face to assess the impact of mutations. This approach opens new avenues for bioinformatics research, particularly in genomics and personalized medicine, by enabling researchers to gain insights into the functional impact of DNA mutations efficiently.
\nBuilding upon this foundation, our new tutorial focuses on predicting the impact of mutations using zero-shot learning with a pre-trained DNA LLM. Zero-shot learning allows us to utilize the pre-trained model directly, without additional training, to make predictions on new, unseen tasks. Specifically, we will use a pre-trained model available on Hugging Face, designed for DNA sequences, to assess the impact of mutations.
\nWe will use Mistral-DNA-v1-17M-hg38
, a mixed model that was pre-trained on the entire Human Genome. It contains approximately 17 million parameters and was trained using the Human Genome assembly GRCh38 on sequences of 10,000 bases (10K):
\n\nAgenda\nIn this tutorial, we will cover:
\n\n
\n- Prepare resources \n
\n
The first step is to install the required dependencies:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-3", "source": [ "!pip install Bio==1.7.1\n", "!pip install transformers -U" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Zero-shot learning is a technique that allows a pre-trained model to make predictions on tasks it wasn't explicitly trained for, leveraging its existing knowledge. This approach is particularly valuable when labeled data is scarce or when rapid predictions are needed. By using a pre-trained DNA LLM, we can compute embeddings for both wild-type and mutated DNA sequences and compare them to quantify the impact of mutations." ], "id": "" } } }, { "id": "cell-4", "source": "Let’s now import them.
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-5", "source": [ "import os\n", "\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import scipy as sp\n", "import torch\n", "import tensorflow as tf\n", "import gzip\n", "from Bio import SeqIO\n", "from transformers import (\n", " AutoConfig,\n", " AutoModelForCausalLM,\n", " AutoTokenizer,\n", " EarlyStoppingCallback,\n", " Trainer,\n", " TrainingArguments,\n", ")" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Zero-shot learning is a technique that allows a pre-trained model to make predictions on tasks it wasn't explicitly trained for, leveraging its existing knowledge. This approach is particularly valuable when labeled data is scarce or when rapid predictions are needed. By using a pre-trained DNA LLM, we can compute embeddings for both wild-type and mutated DNA sequences and compare them to quantify the impact of mutations." ], "id": "" } } }, { "id": "cell-6", "source": "\n\nComment: Versions\nThis tutorial has been tested with following versions:
\n\n
\n- \n
transformers
= 4.48.3You can check the versions with:
\n\ntransformers.__version__\n
We select the appropriate device (CUDA-enabled GPU if available) for running PyTorch operations
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-7", "source": [ "torch.device('cuda' if torch.cuda.is_available() else 'cpu')" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Zero-shot learning is a technique that allows a pre-trained model to make predictions on tasks it wasn't explicitly trained for, leveraging its existing knowledge. This approach is particularly valuable when labeled data is scarce or when rapid predictions are needed. By using a pre-trained DNA LLM, we can compute embeddings for both wild-type and mutated DNA sequences and compare them to quantify the impact of mutations." ], "id": "" } } }, { "id": "cell-8", "source": "Let’s check the GPU usage and RAM:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-9", "source": [ "!nvidia-smi" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Zero-shot learning is a technique that allows a pre-trained model to make predictions on tasks it wasn't explicitly trained for, leveraging its existing knowledge. This approach is particularly valuable when labeled data is scarce or when rapid predictions are needed. By using a pre-trained DNA LLM, we can compute embeddings for both wild-type and mutated DNA sequences and compare them to quantify the impact of mutations." ], "id": "" } } }, { "id": "cell-10", "source": "Let’s configure PyTorch and the CUDA environment – software and hardware ecosystem provided by NVIDIA to enable parallel computing on GPU – to optimize GPU memory usage and performance:
\nEnables CuDNN benchmarking in PyTorch:
\n torch.backends.cudnn.benchmark=True\n
Set an environment variable that configures how PyTorch manages CUDA memory allocations
\n os.environ[\"PYTORCH_CUDA_ALLOC_CONF\"] = \"max_split_size_mb:32\"\n
We now set up the tokenizer to convert raw DNA sequences into a format that the model can process, enabling it to understand and analyze the sequences effectively:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-11", "source": [ "tokenizer = transformers.AutoTokenizer.from_pretrained(\n", " model_name,\n", " use_fast=True,\n", " trust_remote_code=True,\n", ")" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Zero-shot learning is a technique that allows a pre-trained model to make predictions on tasks it wasn't explicitly trained for, leveraging its existing knowledge. This approach is particularly valuable when labeled data is scarce or when rapid predictions are needed. By using a pre-trained DNA LLM, we can compute embeddings for both wild-type and mutated DNA sequences and compare them to quantify the impact of mutations." ], "id": "" } } }, { "id": "cell-12", "source": "\n\nQuestion\nWhat do the parameters?
\n\n
\n- \n
use_fast=True
?- \n
trust_remote_code=True
?\n👁 View solution
\n\n\n
\n- \n
use_fast=True
: Enables the use of a fast tokenizer implementation, which is optimized for speed and efficiency. This is particularly useful when working with large datasets or when performance is a priority.- \n
trust_remote_code=True
: Allows the tokenizer to execute custom code from the model repository. This may be necessary for certain architectures or preprocessing steps that require additional functionality.
We will now load the pre-trained DNA large language model (LLM) and configure it for our specific task of predicting the impact of DNA mutations.
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-13", "source": [ "model=transformers.AutoModelForCausalLM.from_pretrained(\n", " model_name,\n", ")" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Zero-shot learning is a technique that allows a pre-trained model to make predictions on tasks it wasn't explicitly trained for, leveraging its existing knowledge. This approach is particularly valuable when labeled data is scarce or when rapid predictions are needed. By using a pre-trained DNA LLM, we can compute embeddings for both wild-type and mutated DNA sequences and compare them to quantify the impact of mutations." ], "id": "" } } }, { "id": "cell-14", "source": "We would like to ensure that the model correctly handles padding tokens, which are used to standardize the length of sequences within a batch:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-15", "source": [ "model.config.pad_token_id = tokenizer.pad_token_id" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Zero-shot learning is a technique that allows a pre-trained model to make predictions on tasks it wasn't explicitly trained for, leveraging its existing knowledge. This approach is particularly valuable when labeled data is scarce or when rapid predictions are needed. By using a pre-trained DNA LLM, we can compute embeddings for both wild-type and mutated DNA sequences and compare them to quantify the impact of mutations." ], "id": "" } } }, { "id": "cell-16", "source": "Aligning the padding token ID between the model and tokenizer is crucial for maintaining consistency during training and inference.
\nLet’s look at the model:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-17", "source": [ "model" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Zero-shot learning is a technique that allows a pre-trained model to make predictions on tasks it wasn't explicitly trained for, leveraging its existing knowledge. This approach is particularly valuable when labeled data is scarce or when rapid predictions are needed. By using a pre-trained DNA LLM, we can compute embeddings for both wild-type and mutated DNA sequences and compare them to quantify the impact of mutations." ], "id": "" } } }, { "id": "cell-18", "source": "MixtralForCausalLM(\n (model): MixtralModel(\n (embed_tokens): Embedding(4096, 256)\n (layers): ModuleList(\n (0-7): 8 x MixtralDecoderLayer(\n (self_attn): MixtralAttention(\n (q_proj): Linear(in_features=256, out_features=256, bias=False)\n (k_proj): Linear(in_features=256, out_features=256, bias=False)\n (v_proj): Linear(in_features=256, out_features=256, bias=False)\n (o_proj): Linear(in_features=256, out_features=256, bias=False)\n )\n (block_sparse_moe): MixtralSparseMoeBlock(\n (gate): Linear(in_features=256, out_features=8, bias=False)\n (experts): ModuleList(\n (0-7): 8 x MixtralBlockSparseTop2MLP(\n (w1): Linear(in_features=256, out_features=256, bias=False)\n (w2): Linear(in_features=256, out_features=256, bias=False)\n (w3): Linear(in_features=256, out_features=256, bias=False)\n (act_fn): SiLU()\n )\n )\n )\n (input_layernorm): MixtralRMSNorm((256,), eps=1e-05)\n (post_attention_layernorm): MixtralRMSNorm((256,), eps=1e-05)\n )\n )\n (norm): MixtralRMSNorm((256,), eps=1e-05)\n (rotary_emb): MixtralRotaryEmbedding()\n )\n (lm_head): Linear(in_features=256, out_features=4096, bias=False)\n)\n
\n\nQuestion\nWhat do the parameters?
\n\n
\n- How does the vocabulary size of 4,096 relate to k-mers in DNA sequences?
\n- What is the role of the embedding layer in
\nMixtralForCausalLM
?- How many layers does the
\nMixtralForCausalLM
model have, and what is their purpose?- What components make up the self-attention mechanism in MixtralAttention?
\n- How does the Mixture of Experts (MoE) mechanism reduce computational load?
\n- Why is layer normalization important in
\nMixtralForCausalLM
?- What advantage do rotary embeddings offer in understanding DNA sequences?
\n\n👁 View solution
\n\n\n
\n- The vocabulary size corresponds to the number of unique “words” or k-mers (subsequences of DNA) the model can recognize, similar to using k-mers of size six, enhanced by byte-pair encoding for more nuanced patterns.
\n- The embedding layer converts DNA sequences into numerical vectors, enabling the model to process and analyze them.
\n- The model has 8 layers, each containing a
\nMixtralDecoderLayer
. These layers process the embedded input sequences through a series of transformations, including self-attention and mixture of experts, to capture complex patterns in the data.- The self-attention mechanism includes query, key, value, and output projections, which weigh the importance of different tokens in the sequence.
\n- MoE activates only a subset of parameters during each forward pass, using a routing mechanism to direct sequences to specific experts.
\n- Layer normalization stabilizes and accelerates training by ensuring consistent scaling of inputs to the attention mechanism and subsequent layers.
\n- Rotary embeddings enhance the model’s understanding of positional information, providing a more nuanced representation of sequence order compared to traditional methods.
\n
\n\n\nQuestion\nFor a DNA sequence “ACGTAGCATCGGATCTATCTATCGACACTTGGTTATCGATCTACGAGCATCTCGTTAGC”
\n\n
\n- How can we get the hidden states?
\n- How can we compute mean of the hidden states accross the sequence length dimension?
\n- What is the shape of the output?
\n- What does the mean of the hidden states accross the sequence length dimension represent?
\n👁 View solution
\n\nLet’s start by defining the DNA sequence:
\n\ndna = \"ACGTAGCATCGGATCTATCTATCGACACTTGGTTATCGATCTACGAGCATCTCGTTAGC\"\n
\n
\n- To get the hidden state:\n
\n\n
\n- \n
\nTokenize the DNA sequence using the tokenizer
\n\ntokenized_dna = tokenizer(dna, return_tensors = \"pt\")\n
- \n
\nExtract the tensor containing the token IDs from the tokenized output
\n\ninputs = tokenized_dna[\"input_ids\"]\n
- \n
\nPass the tokenized input through the model.
\n\nmodel_outputs = model(inputs)\n
- \n
\nExtract the hidden states from the model’s output.
\n\nhidden_states = model_outputs[0].detach()\n
- \n
\nTo compute mean of the hidden states accross the sequence length dimension:
\n\nembedding_mean = torch.mean(hidden_states[0], dim=0)\n
- \n
\nThe shape is 4,096, the number of possible tokens
\n- It represents the average embedding of the DNA sequence. This fixed-size representation can be used for various downstream tasks, such as classification, clustering, or similarity comparisons.
\n
Let’s look at the portion of the Cystic fibrosis transmembrane conductance regulator (CFTR) gene where a mutation responsible of the Cystic fibrosis appears.
\nIn this portion, we can observe several cases:
\nCases | \nSequences | \n
---|---|
Wild-type without any mutation | \nATTAAAGAAAATATCATCTTTGGTGTTTCCTAT | \n
Mutation ATT to ATA | \nATAAAAGAAAATATCATCTTTGGTGTTTCCTAT | \n
Deletion of TCT codon | \nATTAAAGAAAATATCA—TTGGTGTTTCCTAT | \n
In the second case, the amino acid does not change (silent mutation). In the last case, an amino acid is removed. Let’s look if the mutation and the deletion have the same distance to the wild-type when computing using the DNA embedding.
\nFirst, we define the sequences:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-19", "source": [ "dna_wt= \"ATTAAAGAAAATATCATCTTTGGTGTTTCCTAT\"\n", "dna_mut=\"ATAAAAGAAAATATCATCTTTGGTGTTTCCTAT\"\n", "dna_del=\"ATTAAAGAAAATATCATTGGTGTTTCCTAT\"" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Zero-shot learning is a technique that allows a pre-trained model to make predictions on tasks it wasn't explicitly trained for, leveraging its existing knowledge. This approach is particularly valuable when labeled data is scarce or when rapid predictions are needed. By using a pre-trained DNA LLM, we can compute embeddings for both wild-type and mutated DNA sequences and compare them to quantify the impact of mutations." ], "id": "" } } }, { "id": "cell-20", "source": "Let’s compute the hidden states for all DNA sequences:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-21", "source": [ "tokenized_dna = tokenizer(\n", " [dna_wt, dna_mut, dna_del],\n", " return_tensors=\"pt\",\n", " padding=True,\n", ")\n", "inputs_seqs = tokenized_dna[\"input_ids\"]\n", "model_outputs = model(inputs_seqs)\n", "hidden_states = model_outputs.detach()" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Zero-shot learning is a technique that allows a pre-trained model to make predictions on tasks it wasn't explicitly trained for, leveraging its existing knowledge. This approach is particularly valuable when labeled data is scarce or when rapid predictions are needed. By using a pre-trained DNA LLM, we can compute embeddings for both wild-type and mutated DNA sequences and compare them to quantify the impact of mutations." ], "id": "" } } }, { "id": "cell-22", "source": "We now compute the maximum of the hidden states accross the sequence length dimension:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-23", "source": [ "embedding_max = torch.max(hidden_states, dim=1)[0]" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Zero-shot learning is a technique that allows a pre-trained model to make predictions on tasks it wasn't explicitly trained for, leveraging its existing knowledge. This approach is particularly valuable when labeled data is scarce or when rapid predictions are needed. By using a pre-trained DNA LLM, we can compute embeddings for both wild-type and mutated DNA sequences and compare them to quantify the impact of mutations." ], "id": "" } } }, { "id": "cell-24", "source": "\n\nQuestion\n\n👁 View solution
\n\n
To compare the effects of silent mutation and amino acid deletion, we will compute the distance between the wild-type embeddings and the mutation / deletion embedding using the L2 (Euclidian) distance
\n\n\n\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-25", "source": [ "wt_mut_L2 = torch.norm(embedding_max[0] - embedding_max[1])\n", "print(wt_mut_L2)\n", "wt_del_L2 = torch.norm(embedding_max[0] - embedding_max[2])\n", "print(wt_del_L2)" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Zero-shot learning is a technique that allows a pre-trained model to make predictions on tasks it wasn't explicitly trained for, leveraging its existing knowledge. This approach is particularly valuable when labeled data is scarce or when rapid predictions are needed. By using a pre-trained DNA LLM, we can compute embeddings for both wild-type and mutated DNA sequences and compare them to quantify the impact of mutations." ], "id": "" } } }, { "id": "cell-26", "source": "The L2 distance, also known as the Euclidean distance, is a measure of the straight-line distance between two points in Euclidean space. It is commonly used to quantify the difference between two vectors, such as embeddings in machine learning.
\nFor two vectors \\(a\\) and \\(b\\) in an \\(n\\)-dimensional space, where \\(a=[a\\_{1},a\\_{2},...,a\\_{n}]\\) and \\(b=[b\\_{1},b\\_{2},...,b\\_{n}]\\), the L2 distance is calculated as:
\n\\( L2 = \\sqrt{\\sum_{i} (a_{i}-b_{i})^{2} \\)
\n
tensor(27.4657)\ntensor(145.7797)\n
\n\nQuestion\n\n
\n- Is it the silent mutation or the amino acid deletion that have the lowest L2 distance to the wild-type?
\n- How to interpret the result?
\n\n👁 View solution
\n\n\n
\n- The silent mutation has a lower L2 distance to the wild-type compared to the mutation with amino acid deletion.
\n- A smaller L2 distance indicates that the sequences are more similar, while a larger distance suggests greater dissimilarity.
\n
Single Nucleotide Polymorphisms (SNPs) are the most common type of genetic variation, involving a change in a single nucleotide within a DNA sequence. These variations can occur in different regions of the genome, including exons (the coding regions of genes) and introns (the non-coding regions within genes). Understanding the impact of SNPs in these regions is crucial for assessing their role in genetic disorders, phenotypic traits, and overall genome function.
\nWe will now leverage the pre-trained DNA language model to compare the effects of SNPs in exons and introns. By utilizing embeddings generated from DNA sequences, we can quantify the impact of these variations and gain insights into their functional consequences.
\nTo begin our analysis, we need to load the sequence data, which includes sequences with Single Nucleotide Polymorphisms (SNPs) and their corresponding reference sequences (wild-type) without SNPs. These sequences are derived from both introns and exons and are stored in compressed FASTA format on GitHub:
\nWild-type (without SNPs) | \nMutated (with SNP) | \n\n |
---|---|---|
Intron | \nSNPintron_ref_201b.fasta.gz | \nSNPintron_alt_201b.fasta.gz | \n
Exon | \nSNPexon_ref_201b.fasta.gz | \nSNPexon_alt_201b.fasta.gz | \n
We need to get files and read the sequences from them:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-29", "source": [ "def downloadReadFastaFile(fasta_file):\n", " response = requests.get(url)\n", " # Check if the request was successful\n", " if response.status_code == 200:\n", " # Open a local file in binary write mode and save the content\n", " with open('file.gz', 'wb') as file:\n", " file.write(response.content)\n", " print(\"File downloaded successfully.\")\n", " else:\n", " print(f\"Failed to download file. HTTP Status code: {response.status_code}\")\n", " # Read the file\n", " seql_list=[]\n", " with gzip.open(fasta_file, \"rt\") as handle:\n", " for record in SeqIO.parse(handle, \"fasta\"):\n", " seqj=str(record.seq)\n", " seql_list.append(seqj)\n", " return seql_list\n", "\n", "exon_wt_seqs = readRegularFastaFile(exon_wt_snp_fp)\n", "exon_mut_seqs = readRegularFastaFile(exon_mut_snp_fp)\n", "intron_wt_seqs = readRegularFastaFile(intron_wt_snp_fp)\n", "intron_mut_seqs = readRegularFastaFile(intron_mut_snp_fp)" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Zero-shot learning is a technique that allows a pre-trained model to make predictions on tasks it wasn't explicitly trained for, leveraging its existing knowledge. This approach is particularly valuable when labeled data is scarce or when rapid predictions are needed. By using a pre-trained DNA LLM, we can compute embeddings for both wild-type and mutated DNA sequences and compare them to quantify the impact of mutations." ], "id": "" } } }, { "id": "cell-30", "source": "\n\nQuestion\n\n
\n- How many sequences are in the data?
\n- How are the sequences?
\n\n👁 View solution
\n\n1.\n2.
\n
We will keep only the 100 first sequences:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-31", "source": [ "kseq = 100\n", "exon_wt_seqs = exon_wt_seqs[0:kseq]\n", "exon_mut_seqs = exon_mut_seqs[0:kseq]\n", "intron_wt_seqs = intron_wt_seqs[0:kseq]\n", "intron_mut_seqs = intron_mut_seqs[0:kseq]" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Zero-shot learning is a technique that allows a pre-trained model to make predictions on tasks it wasn't explicitly trained for, leveraging its existing knowledge. This approach is particularly valuable when labeled data is scarce or when rapid predictions are needed. By using a pre-trained DNA LLM, we can compute embeddings for both wild-type and mutated DNA sequences and compare them to quantify the impact of mutations." ], "id": "" } } }, { "id": "cell-32", "source": "\n\nQuestion\nHow many differences (SNPs) between reference and alternative sequences are there for first exon sequence?
\n\n👁 View solution
\n\n
To compute the effect of SNPs, we need :
\n def computeEmbedding(seqs):\n tokenized_dna = tokenizer(\n seqs,\n return_tensors=\"pt\",\n padding=True,\n )\n inputs_seqs = tokenized_dna[\"input_ids\"]\n hidden_states = model(inputs_seqs)[0].detach().cpu().numpy()\n return torch.max(hidden_states, dim=1)\n
def computeMutationEffect(wt_seqs, mut_seqs):\n wt_embedding = computeEmbedding(wt_seqs)\n mut_embedding = computeEmbedding(mut_seqs)\n return torch.norm(mut_embedding - wt_embedding)\n
\n\nQuestion\nWhat are the dimensions of
\nexon_SNP_distL2
andintron_SNP_distL2
?\n👁 View solution
\n\n
We can now quantify the impact of SNPs and determine if the differences are statistically significant.
\nTo visualize the predicted effects of SNPs, we can use a boxplot to compare the L2 distances for SNPs in exons and introns. The L2 distance serves as a metric for the impact of mutations, with higher distances indicating a more significant effect.
\nBoxplot of predicted SNP effects
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-35", "source": [ "SNPs_distL2 = {\"exons\": exon_SNP_distL2, \"introns\": intron_SNP_distL2}\n", "\n", "fig, ax = plt.subplots()\n", "ax.boxplot(SNPs_distL2.values())\n", "ax.set_xticklabels(SNPs_distL2.keys())" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "Zero-shot learning is a technique that allows a pre-trained model to make predictions on tasks it wasn't explicitly trained for, leveraging its existing knowledge. This approach is particularly valuable when labeled data is scarce or when rapid predictions are needed. By using a pre-trained DNA LLM, we can compute embeddings for both wild-type and mutated DNA sequences and compare them to quantify the impact of mutations." ], "id": "" } } }, { "id": "cell-36", "source": "\n\nQuestion\n\nWhat can we conclude from the boxplot?
\n\n👁 View solution
\n\nFrom the plot, we observe that the L2 distance is generally higher for SNPs in exons compared to those in introns. This suggests that SNPs in exons have a stronger predicted effect, which aligns with their role in coding regions of genes.
\n
To determine if the observed differences in L2 distances are statistically significant, we can perform hypothesis tests to compare the distributions:
\nT-test:
\nsp.stats.ttest_ind(exon_SNP_distL2, intron_SNP_distL2)\n
\n\n\nQuestion\nWhat can we conclude from the t-test?
\n👁 View solution
\n\nThe t-test yields a p-value of approximately \\(7.56 \\cdot 10^{−7}\\), indicating a statistically significant difference between the L2 distances of SNPs in exons and introns.
\n
Wilcoxon rank-sum test:
\nsp.stats.wilcoxon(distL2_exonSNPs,distL2_intronSNPs)\n
\n\n\nQuestion\nWhat can we conclude from the Wilcoxon test?
\n👁 View solution
\n\nThe Wilcoxon rank-sum test also shows a significant p-value of approximately \\(2.87 \\cdot 10^{−6}\\), confirming that the distributions are significantly different.
\n
The analysis demonstrates that SNPs in exons have a more substantial impact on the sequence embeddings compared to SNPs in introns, as evidenced by higher L2 distances. This finding is statistically significant, highlighting the importance of exonic SNPs in potentially altering gene function and expression. By understanding these differences, researchers can gain insights into the functional consequences of genetic variations in different genomic regions.
\nThroughout this tutorial, we explored a comprehensive workflow for analyzing DNA sequences and assessing the impact of mutations using a pre-trained DNA language model. We began by tokenizing DNA sequences, converting them into numerical representations that the model could process and analyze effectively. Following this, we loaded and configured a pre-trained model to handle these sequences, ensuring seamless integration with the tokenizer. We then delved into comparing the effects of mutations, both with and without amino acid modifications, showcasing the model’s ability to discern subtle differences in sequence impacts. Furthermore, we focused on analyzing the effects of Single Nucleotide Polymorphisms (SNPs) in exons and introns. By loading the relevant sequences, computing the L2 distances between embeddings of wild-type and mutated sequences, and visualizing the results, we quantified and compared the impact of SNPs in these regions. Our analysis revealed that SNPs in exons generally have a more significant impact than those in introns, a finding supported by statistical tests. This tutorial underscores the power of pre-trained models in bioinformatics, offering valuable insights into the functional consequences of genetic variations and paving the way for further research in genomics.
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "cell_type": "markdown", "id": "final-ending-cell", "metadata": { "editable": false, "collapsed": false }, "source": [ "# Key Points\n\n", "- Pre-trained DNA language models are powerful tools for analyzing genetic sequences, enabling efficient and accurate assessment of mutation impacts without extensive computational resources.\n", "- SNPs in exons generally have a more significant impact on gene function compared to those in introns, highlighting the importance of focusing on coding regions for understanding genetic variations.\n", "- Utilizing embeddings to quantify the effects of mutations provides a robust method for comparing sequence variations, offering insights into the functional consequences of SNPs and other genetic modifications.\n", "- Employing statistical tests, such as t-tests and Wilcoxon rank-sum tests, is crucial for validating observed differences in genetic data, ensuring that findings are supported by empirical evidence.\n", "\n# Congratulations on successfully completing this tutorial!\n\n", "Please [fill out the feedback on the GTN website](https://training.galaxyproject.org/training-material/topics/statistics/tutorials/genomic-llm-zeroshot-prediction/tutorial.html#feedback) and check there for further resources!\n" ] } ] }