<div style="border: 2px solid #8A9AD0; margin: 1em 0.2em; padding: 0.5em;">

# Optimizing DNA Sequences for Biological Functions using a DNA LLM

by [Raphael Mourad](https://training.galaxyproject.org/hall-of-fame/raphaelmourad/), [Bérénice Batut](https://training.galaxyproject.org/hall-of-fame/bebatut/)

CC-BY licensed content from the [Galaxy Training Network](https://training.galaxyproject.org/)

**Objectives**

- to do

**Objectives**

- pretraining LLM for DNA
- finetuning LLM
- zeroshot prediction for DNA variants and synthetic DNA sequence generation.

**Time Estimation: 3H**
</div>


<h1 id="prepare-resources">Prepare resources</h1>
<h2 id="install-dependencies">Install dependencies</h2>


In [None]:
!pip install datasets==3.0.1
!pip install torch==2.5.0
!pip install transformers -U
!pip install accelerate==1.1.0
!pip install peft==0.13.2
!pip install bitsandbytes==0.44.1
!pip install flash-attn==2.6.3
!pip install Bio==1.7.1
!pip install orfipy

<h2 id="import-python-libraries">Import Python libraries</h2>
<ul>
<li><code style="color: inherit">torch</code></li>
<li><code style="color: inherit">flash_attn</code></li>
<li><code style="color: inherit">numpy</code></li>
<li><code style="color: inherit">transformers</code>
<ul>
<li><code style="color: inherit">AutoTokenizer</code></li>
<li><code style="color: inherit">EarlyStoppingCallback</code></li>
<li><code style="color: inherit">Trainer</code></li>
<li><code style="color: inherit">TrainingArguments</code></li>
<li><code style="color: inherit">AutoModelForCausalLM</code></li>
<li><code style="color: inherit">AutoConfig</code></li>
<li><code style="color: inherit">DataCollatorForLanguageModeling</code></li>
</ul>
</li>
<li><code style="color: inherit">datasets</code></li>
<li><code class="language-plaintext highlighter-rouge">accelerate</code></li>
</ul>


In [None]:
#
# During the class,
import os

import sys
import time
from os import path
import gc


import flash_attn
import torch
import numpy as np
import pandas as pd
import scipy as sp
import matplotlib.pyplot as plt

import transformers

from transformers import AutoTokenizer
from transformers import EarlyStoppingCallback, Trainer, TrainingArguments
from transformers import AutoModelForCausalLM, AutoConfig
from transformers import DataCollatorForLanguageModeling
from datasets import load_dataset

import accelerate

<h2 id="check-versions">Check versions</h2>
<p>Numpy version &gt; 1.26.4</p>


In [None]:
np.__version__

<p>transformers version &gt; 4.47.1</p>


In [None]:
transformers.__version__

<p>flash_attn &gt; 2.6.0.post1 and 2.7.0.post2</p>


In [None]:
flash_attn.__version__

<p>accelerate &gt; 0.32.1</p>


In [None]:
# Tested with accelerate==0.32.1

accelerate.__version__

<h2 id="prepare-gpu">Prepare GPU</h2>


In [None]:
# CHECK GPU
# We can see how many VRAM is used and how much the GPU is used.
!nvidia-smi

<p>Thu Feb  6 07:59:23 2025
    +—————————————————————————————–+
    | NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
    |—————————————–+————————+———————-+
    | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
    |                                         |                        |               MIG M. |
    |=========================================+========================+======================|
    |   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
    | N/A   36C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
    |                                         |                        |                  N/A |
    +—————————————–+————————+———————-+</p>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: transparent"><code style="color: inherit">+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
</code></pre></div></div>


In [None]:
# LOOK AT GPU USAGE AND RAM
!nvidia-smi

<p>Thu Feb  6 16:49:41 2025
    +—————————————————————————————–+
    | NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
    |—————————————–+————————+———————-+
    | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
    |                                         |                        |               MIG M. |
    |=========================================+========================+======================|
    |   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
    | N/A   54C    P8             10W /   70W |       2MiB /  15360MiB |      0%      Default |
    |                                         |                        |                  N/A |
    +—————————————–+————————+———————-+</p>
<div class="language-plaintext highlighter-rouge"><div><pre style="color: inherit; background: transparent"><code style="color: inherit">+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
</code></pre></div></div>
<p>Control the use of ram by CUDA</p>


In [None]:
torch.backends.cudnn.benchmark=True
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:32 "

In [None]:
# Check GPU
import torch
torch.device('cuda' if torch.cuda.is_available() else 'cpu')

<p>device(type=’cuda’)</p>
<h1 id="get-the-model">Get the model</h1>
<p>Mistral-DNA-v0.1 was derived from Mixtral-8x7B for the human genome. Mixtral-8x7B was simplified for DNA: the number of layers and the hidden size were reduced. The model was pretrained using the human genome hg38 with 200B DNA sequences.</p>
<p>The model can be downloaded on HuggingFace: https://huggingface.co/RaphaelMourad/Mistral-DNA-v0.1</p>


In [None]:
!git clone https://github.com/raphaelmourad/Mistral-DNA.git

In [None]:
!ls

<p>Mistral-DNA  sample_data</p>


In [None]:
# SET DIRECTORY
os.chdir("Mistral-DNA/")
print(os.getcwd())

<p>/content/Mistral-DNA</p>


# Key Points

- To be added

# Congratulations on successfully completing this tutorial!

Please [fill out the feedback on the GTN website](https://training.galaxyproject.org/training-material/topics/statistics/tutorials/genomic-llm-sequence-optimization/tutorial.html#feedback) and check there for further resources!
