Optimizing DNA Sequences for Biological Functions using a DNA LLM
Under Development!
This tutorial is not in its final state. The content may change a lot in the next months. Because of this status, it is also not listed in the topic pages.
Author(s) |
![]() |
Reviewers |
|
OverviewQuestions:
Objectives:
to do
Requirements:
pretraining LLM for DNA
finetuning LLM
zeroshot prediction for DNA variants and synthetic DNA sequence generation.
- tutorial Hands-on: Introduction to Python
- tutorial Hands-on: Python - Warm-up for statistics and machine learning
- slides Slides: Foundational Aspects of Machine Learning using Python
- tutorial Hands-on: Foundational Aspects of Machine Learning using Python
- slides Slides: Neural networks using Python
- tutorial Hands-on: Neural networks using Python
- slides Slides: Deep Learning (without Generative Artificial Intelligence) using Python
- tutorial Hands-on: Deep Learning (without Generative Artificial Intelligence) using Python
- tutorial Hands-on: Pretraining a Large Language Model (LLM) from Scratch on DNA Sequences
- tutorial Hands-on: Fine-tuning a LLM for DNA Sequence Classification
- tutorial Hands-on: Predicting Mutation Impact with Zero-shot Learning using a pretrained DNA LLM
- tutorial Hands-on: Generating Artificial Yeast DNA Sequences using a DNA LLM
Time estimation: 3 hoursLevel: Intermediate IntermediateSupporting Materials:Published: Apr 17, 2025Last modification: Apr 17, 2025License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MITpurl PURL: https://gxy.io/GTN:T00522version Revision: 1
Best viewed in a Jupyter NotebookThis tutorial is best viewed in a Jupyter notebook! You can load this notebook one of the following ways
Run on the GTN with JupyterLite (in-browser computations)
Launching the notebook in Jupyter in Galaxy
- Instructions to Launch JupyterLab
- Open a Terminal in JupyterLab with File -> New -> Terminal
- Run
wget https://training.galaxyproject.org/training-material/topics/statistics/tutorials/genomic-llm-sequence-optimization/statistics-genomic-llm-sequence-optimization.ipynb
- Select the notebook that appears in the list of files on the left.
Downloading the notebook
- Right click one of these links: Jupyter Notebook (With Solutions), Jupyter Notebook (Without Solutions)
- Save Link As..
Prepare resources
Install dependencies
!pip install datasets==3.0.1
!pip install torch==2.5.0
!pip install transformers -U
!pip install accelerate==1.1.0
!pip install peft==0.13.2
!pip install bitsandbytes==0.44.1
!pip install flash-attn==2.6.3
!pip install Bio==1.7.1
!pip install orfipy
Import Python libraries
torch
flash_attn
numpy
transformers
AutoTokenizer
EarlyStoppingCallback
Trainer
TrainingArguments
AutoModelForCausalLM
AutoConfig
DataCollatorForLanguageModeling
datasets
accelerate
#
# During the class,
import os
import sys
import time
from os import path
import gc
import flash_attn
import torch
import numpy as np
import pandas as pd
import scipy as sp
import matplotlib.pyplot as plt
import transformers
from transformers import AutoTokenizer
from transformers import EarlyStoppingCallback, Trainer, TrainingArguments
from transformers import AutoModelForCausalLM, AutoConfig
from transformers import DataCollatorForLanguageModeling
from datasets import load_dataset
import accelerate
Check versions
Numpy version > 1.26.4
np.__version__
transformers version > 4.47.1
transformers.__version__
flash_attn > 2.6.0.post1 and 2.7.0.post2
flash_attn.__version__
accelerate > 0.32.1
# Tested with accelerate==0.32.1
accelerate.__version__
Prepare GPU
# CHECK GPU
# We can see how many VRAM is used and how much the GPU is used.
!nvidia-smi
Thu Feb 6 07:59:23 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
| N/A 36C P8 9W / 70W | 0MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
# LOOK AT GPU USAGE AND RAM
!nvidia-smi
Thu Feb 6 16:49:41 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
| N/A 54C P8 10W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Control the use of ram by CUDA
torch.backends.cudnn.benchmark=True
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:32 "
# Check GPU
import torch
torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device(type='cuda')
Get the model
Mistral-DNA-v0.1 was derived from Mixtral-8x7B for the human genome. Mixtral-8x7B was simplified for DNA: the number of layers and the hidden size were reduced. The model was pretrained using the human genome hg38 with 200B DNA sequences.
The model can be downloaded on HuggingFace: https://huggingface.co/RaphaelMourad/Mistral-DNA-v0.1
!git clone https://github.com/raphaelmourad/Mistral-DNA.git
!ls
Mistral-DNA sample_data
# SET DIRECTORY
os.chdir("Mistral-DNA/")
print(os.getcwd())
/content/Mistral-DNA