Optimizing DNA Sequences for Biological Functions using a DNA LLM

Under Development!

This tutorial is not in its final state. The content may change a lot in the next months. Because of this status, it is also not listed in the topic pages.

Author(s)	Raphael Mourad Bérénice Batut
Reviewers

Overview
Questions:

to do

Objectives:

pretraining LLM for DNA

finetuning LLM

zeroshot prediction for DNA variants and synthetic DNA sequence generation.

Requirements:

tutorial Hands-on: Introduction to Python

tutorial Hands-on: Python - Warm-up for statistics and machine learning

tutorial Hands-on: Foundational Aspects of Machine Learning using Python

slides Slides: Neural networks using Python

tutorial Hands-on: Neural networks using Python

slides Slides: Deep Learning (without Generative Artificial Intelligence) using Python

tutorial Hands-on: Deep Learning (without Generative Artificial Intelligence) using Python

tutorial Hands-on: Pretraining a Large Language Model (LLM) from Scratch on DNA Sequences

tutorial Hands-on: Fine-tuning a LLM for DNA Sequence Classification

tutorial Hands-on: Predicting Mutation Impact with Zero-shot Learning using a pretrained DNA LLM

tutorial Hands-on: Generating Artificial Yeast DNA Sequences using a DNA LLM

Time estimation: 3 hours

Level: Intermediate Intermediate

Supporting Materials:

Jupyter Notebook

instances Available on these Galaxies

Possibly Working

UseGalaxy.eu

UseGalaxy.org

UseGalaxy.org.au

UseGalaxy.fr

Published: Apr 17, 2025

Last modification: May 22, 2025

License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT

purl PURL: https://gxy.io/GTN:T00522

version Revision: 2

Best viewed in a Jupyter Notebook

This tutorial is best viewed in a Jupyter notebook! You can load this notebook one of the following ways

Run on the GTN with JupyterLite (in-browser computations)

Click to Launch JupyterLite

Launching the notebook in Jupyter in Galaxy

Instructions to Launch JupyterLab

Open a Terminal in JupyterLab with File -> New -> Terminal

Run wget https://training.galaxyproject.org/training-material/topics/statistics/tutorials/genomic-llm-sequence-optimization/statistics-genomic-llm-sequence-optimization.ipynb

Select the notebook that appears in the list of files on the left.

Downloading the notebook

Right click one of these links: Jupyter Notebook (With Solutions), Jupyter Notebook (Without Solutions)

Save Link As..

Prepare resources

Install dependencies

!pip install datasets==3.0.1
!pip install torch==2.5.0
!pip install transformers -U
!pip install accelerate==1.1.0
!pip install peft==0.13.2
!pip install bitsandbytes==0.44.1
!pip install flash-attn==2.6.3
!pip install Bio==1.7.1
!pip install orfipy

Import Python libraries

torch
flash_attn
numpy
transformers
- AutoTokenizer
- EarlyStoppingCallback
- Trainer
- TrainingArguments
- AutoModelForCausalLM
- AutoConfig
- DataCollatorForLanguageModeling
datasets
accelerate

# 
# During the class, 
import os

import sys
import time
from os import path
import gc


import flash_attn
import torch
import numpy as np
import pandas as pd
import scipy as sp
import matplotlib.pyplot as plt

import transformers

from transformers import AutoTokenizer
from transformers import EarlyStoppingCallback, Trainer, TrainingArguments
from transformers import AutoModelForCausalLM, AutoConfig
from transformers import DataCollatorForLanguageModeling
from datasets import load_dataset

import accelerate

Check versions

Numpy version > 1.26.4

np.__version__

transformers version > 4.47.1

transformers.__version__

flash_attn > 2.6.0.post1 and 2.7.0.post2

flash_attn.__version__

accelerate > 0.32.1

# Tested with accelerate==0.32.1

accelerate.__version__

Prepare GPU

# CHECK GPU
# We can see how many VRAM is used and how much the GPU is used.
!nvidia-smi

Thu Feb  6 07:59:23 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   36C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

# LOOK AT GPU USAGE AND RAM
!nvidia-smi

Thu Feb  6 16:49:41 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   54C    P8             10W /   70W |       2MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Control the use of ram by CUDA

torch.backends.cudnn.benchmark=True
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:32 "

# Check GPU
import torch
torch.device('cuda' if torch.cuda.is_available() else 'cpu')

device(type='cuda')

Get the model

Mistral-DNA-v0.1 was derived from Mixtral-8x7B for the human genome. Mixtral-8x7B was simplified for DNA: the number of layers and the hidden size were reduced. The model was pretrained using the human genome hg38 with 200B DNA sequences.

The model can be downloaded on HuggingFace: https://huggingface.co/RaphaelMourad/Mistral-DNA-v0.1

!git clone https://github.com/raphaelmourad/Mistral-DNA.git

!ls

Mistral-DNA  sample_data

# SET DIRECTORY
os.chdir("Mistral-DNA/")
print(os.getcwd())

/content/Mistral-DNA

You've Finished the Tutorial

Key points

To be added

Frequently Asked Questions

Have questions about this tutorial? Have a look at the available FAQ pages and support channels

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Citing this Tutorial

Raphael Mourad, Bérénice Batut, Optimizing DNA Sequences for Biological Functions using a DNA LLM (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/statistics/tutorials/genomic-llm-sequence-optimization/tutorial.html Online; accessed TODAY
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

@misc{statistics-genomic-llm-sequence-optimization,
author = "Raphael Mourad and Bérénice Batut",
	title = "Optimizing DNA Sequences for Biological Functions using a DNA LLM (Galaxy Training Materials)",
	year = "",
	month = "",
	day = "",
	url = "\url{https://training.galaxyproject.org/training-material/topics/statistics/tutorials/genomic-llm-sequence-optimization/tutorial.html}",
	note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
	doi = {10.1371/journal.pcbi.1010752},
	url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
	year = 2023,
	month = {jan},
	publisher = {Public Library of Science ({PLoS})},
	volume = {19},
	number = {1},
	pages = {e1010752},
	author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
	editor = {Francis Ouellette},
	title = {Galaxy Training: A powerful framework for teaching!},
	journal = {PLoS Comput Biol}
}

                   

Congratulations on successfully completing this tutorial!

You can use Ephemeris's shed-tools install command to install the tools used in this tutorial.
shed-tools install [-g GALAXY] [-a API_KEY] -t <(curl https://training.galaxyproject.org/training-material/api/topics/statistics/tutorials/genomic-llm-sequence-optimization/tutorial.json | jq .admin_install_yaml -r)
Alternatively you can copy and paste the following YAML
---
install_tool_dependencies: true
install_repository_dependencies: true
install_resolver_dependencies: true
tools: []

No feedback has been recieved yet for this training. Be the first one by filling in the feedback form.