Artificial Intelligence and Machine Learning in Life Sciences using Python

PURL: https://gxy.io/GTN:P00029

Comment: What is a Learning Pathway?

We recommend you follow the tutorials in the order presented on this page. They have been selected to fit together and build up your knowledge step by step. If a lesson has both slides and a tutorial, we recommend you start with the slides, then proceed with the tutorial.

Artificial intelligence (AI) has permeated our lives, transforming how we live and work. Over the past few years, a rapid and disruptive acceleration of progress in AI has occurred, driven by significant advances in widespread data availability, computing power and machine learning. Remarkable strides were made in particular in the development of foundation models - AI models trained on extensive volumes of unlabelled data. Moreover, given the large amounts of omics data that are being generated and made accessible to researchers due to the drop in the cost of high-throughput technologies, analysing these complex high-volume data is not trivial, and the use of classical statistics can not explore their full potential. As such, Machine Learning (ML) and Artificial Intelligence (AI) have been recognized as key opportunity areas for ELIXIR, as evidenced by a number of ongoing activities and efforts throughout the community. However, beyond the technological advances, it is equally important that the individual researchers acquire the necessary knowledge and skills to fully take advantage of Machine Learning. Being aware of the challenges, opportunities and constraints that ML applications entail, is a critical aspect in ensuring high quality research in life sciences

Module 0: Python warm-up

Python warm-up for statistics and Machine Learning

Time estimation: 6 hours

Learning Objectives

Learn the fundamentals of programming in Python
to do

Lesson	Slides	Hands-on	Recordings
Introduction to Python tutorial Toggle Dropdown Automatic translations Deutsch Español 中文 Français 日本語 Português العربية More Languages		tutorial Toggle Dropdown Automatic translations Deutsch Español 中文 Français 日本語 Português العربية More Languages

Module 1: Foundational Aspects of Machine Learning

Foundational Aspects of Machine Learning

Time estimation: 3 hours

Learning Objectives

Understand and apply the general syntax and functions of the scikit-learn library to implement basic Machine Learning models in Python.
Identify and explain the concepts of overfitting and underfitting in Machine Learning models, and discuss their implications on model performance.
Analyze the need for regularization techniques and justify their importance in preventing overfitting and improving model generalization.
Evaluate the effectiveness of cross-validation and test sets in assessing model performance and implement these techniques using scikit-learn.
Compare different evaluation metrics and select appropriate metrics for imbalanced datasets, ensuring accurate and meaningful model assessment.

Lesson	Slides	Hands-on	Recordings
Foundational Aspects of Machine Learning using Python elixir ai-ml jupyter-notebook tutorial Toggle Dropdown Automatic translations Deutsch Español 中文 Français 日本語 Português العربية More Languages		tutorial Toggle Dropdown Automatic translations Deutsch Español 中文 Français 日本語 Português العربية More Languages

Module 2: Neural networks

Neural networks

Time estimation: 3 hours

Learning Objectives

Initializing model with a single layer (code)
Loss function
Model as equation
How model parameters are learned
Training steps (code)
Predictions and save+load models
Initializing model with multiple layers (code)
Forward step
Concept of backprop and epochs
Training (code)

Lesson	Slides	Hands-on	Recordings

Module 3: Deep Learning (without Generative Artificial Intelligence)

Deep Learning (without Generative Artificial Intelligence)

Time estimation: 3 hours

Learning Objectives

Input data representation
Concept of filters
Concept of pooling layers
Initialising a model with conv layers (code)
Concept of RNNs
Concept of attention
Implementation of RNN (code)
Implementation of attention mechanism (code)
Implementation of fine-tuning (code)

Lesson	Slides	Hands-on	Recordings

Module 4: Generative Artificial Intelligence and Large Langage Model for Genomics using Python

This tutorial series provides a comprehensive guide to leveraging large language models for genomics, covering pretraining, fine-tuning, mutation impact prediction, sequence generation, and optimization.

Time estimation: 15 hours

Learning Objectives

Identify and load a pre-trained language model (LLM) suitable for DNA sequence analysis.
Explain the role of a tokenizer in converting DNA sequences into numerical tokens for model processing.
Prepare and tokenize DNA sequence datasets for model training and evaluation.
Configure and implement data collation to organize tokenized data into batches for efficient training.
Define and configure hyperparameters for pretraining a model, such as learning rate and batch size.
Monitor and evaluate the model's performance during training to ensure effective learning.
Use the trained model to generate embeddings for DNA sequences and interpret these embeddings for downstream bioinformatics applications.
Develop a complete workflow for training a language model on DNA sequences, from data preparation to model evaluation, and apply it to real-world bioinformatics tasks.
Load a pre-trained model and modify its architecture to include a classification layer.
Prepare and preprocess labeled DNA sequences for fine-tuning.
Define and configure training parameters to optimize the model's performance on the classification task.
Evaluate the fine-tuned model's accuracy and robustness in distinguishing between different classes of DNA sequences.
Explain the concept of zero-shot learning and its application in predicting the impact of DNA mutations using pre-trained large language models (LLMs).
Utilize a pre-trained DNA LLM from Hugging Face to compute embeddings for wild-type and mutated DNA sequences.
Compare the embeddings of wild-type and mutated sequences to quantify the impact of mutations using L2 distance as a metric.
Interpret the results of the L2 distance calculations to determine the significance of mutation effects and discuss potential implications in genomics research.
Develop a script to automate the process of predicting mutation impacts using zero-shot learning, enabling researchers to apply this method to their own datasets efficiently.
Describe the process of generating synthetic DNA sequences using pre-trained language models and explain the significance of temperature settings in controlling sequence variability.
Set up a computational environment (e.g., Google Colab) and configure a pre-trained language model to generate synthetic DNA sequences, ensuring all necessary libraries are installed and configured.
Use k-mer counts and Principal Component Analysis (PCA) to compare generated synthetic DNA sequences with real genomic sequences, identifying similarities and differences.
Perform BLAST searches to assess the novelty of generated DNA sequences and interpret the results to determine the biological relevance and uniqueness of the synthetic sequences.
Develop a pipeline to detect open reading frames (ORFs) within generated DNA sequences and translate them into amino acid sequences, demonstrating the potential for creating novel synthetic genes.
pretraining LLM for DNA
finetuning LLM
zeroshot prediction for DNA variants and synthetic DNA sequence generation.

Lesson	Slides	Hands-on	Recordings
Pretraining a Large Language Model (LLM) from Scratch on DNA Sequences elixir ai-ml Large Language Model jupyter-notebook tutorial Toggle Dropdown Automatic translations Deutsch Español 中文 Français 日本語 Português العربية More Languages		tutorial Toggle Dropdown Automatic translations Deutsch Español 中文 Français 日本語 Português العربية More Languages
Fine-tuning a LLM for DNA Sequence Classification elixir ai-ml Large Language Model jupyter-notebook tutorial Toggle Dropdown Automatic translations Deutsch Español 中文 Français 日本語 Português العربية More Languages		tutorial Toggle Dropdown Automatic translations Deutsch Español 中文 Français 日本語 Português العربية More Languages
Predicting Mutation Impact with Zero-shot Learning using a pretrained DNA LLM elixir ai-ml Large Language Model jupyter-notebook tutorial Toggle Dropdown Automatic translations Deutsch Español 中文 Français 日本語 Português العربية More Languages		tutorial Toggle Dropdown Automatic translations Deutsch Español 中文 Français 日本語 Português العربية More Languages
Generating Artificial Yeast DNA Sequences using a DNA LLM elixir ai-ml Large Language Model jupyter-notebook tutorial Toggle Dropdown Automatic translations Deutsch Español 中文 Français 日本語 Português العربية More Languages		tutorial Toggle Dropdown Automatic translations Deutsch Español 中文 Français 日本語 Português العربية More Languages

Module 5: Regulations/standards for AI using DOME

Regulations/standards for AI using DOME

Time estimation: 3 hours

Learning Objectives

Explain the importance of data provenance and dataset splits in ensuring the integrity and reproducibility of AI research.
Develop a comprehensive plan for documenting and sharing AI model configurations, datasets, and evaluation results to enhance transparency and reproducibility in their research.

Lesson	Slides	Hands-on	Recordings
Regulations/standards for AI using DOME elixir ai-ml plain text Toggle Dropdown Automatic translations Deutsch Español 中文 Français 日本語 Português العربية More Languages Plain text slides tutorial Toggle Dropdown Automatic translations Deutsch Español 中文 Français 日本語 Português العربية More Languages	plain text Toggle Dropdown Automatic translations Deutsch Español 中文 Français 日本語 Português العربية More Languages Plain text slides	tutorial Toggle Dropdown Automatic translations Deutsch Español 中文 Français 日本語 Português العربية More Languages

Editorial Board

This material is reviewed by our Editorial Board:

Bérénice Batut