Artificial Intelligence and Machine Learning in Life Sciences using Python

purlPURL: https://gxy.io/GTN:P00029
Comment: What is a Learning Pathway?
A graphic depicting a winding path from a start symbol to a trophy, with tutorials along the way
We recommend you follow the tutorials in the order presented on this page. They have been selected to fit together and build up your knowledge step by step. If a lesson has both slides and a tutorial, we recommend you start with the slides, then proceed with the tutorial.

Artificial intelligence (AI) has permeated our lives, transforming how we live and work. Over the past few years, a rapid and disruptive acceleration of progress in AI has occurred, driven by significant advances in widespread data availability, computing power and machine learning. Remarkable strides were made in particular in the development of foundation models - AI models trained on extensive volumes of unlabelled data. Moreover, given the large amounts of omics data that are being generated and made accessible to researchers due to the drop in the cost of high-throughput technologies, analysing these complex high-volume data is not trivial, and the use of classical statistics can not explore their full potential. As such, Machine Learning (ML) and Artificial Intelligence (AI) have been recognized as key opportunity areas for ELIXIR, as evidenced by a number of ongoing activities and efforts throughout the community. However, beyond the technological advances, it is equally important that the individual researchers acquire the necessary knowledge and skills to fully take advantage of Machine Learning. Being aware of the challenges, opportunities and constraints that ML applications entail, is a critical aspect in ensuring high quality research in life sciences

Module 0: Python warm-up

Python warm-up for statistics and Machine Learning

Time estimation: 6 hours

Learning Objectives
  • Learn the fundamentals of programming in Python
  • to do
Lesson Slides Hands-on Recordings
Introduction to Python

Module 1: Foundational Aspects of Machine Learning

Foundational Aspects of Machine Learning

Time estimation: 3 hours

Learning Objectives
  • general sklearn syntax intro
  • overfit/underfit
  • the need for regularization
  • cross validation and a test set
  • metrics and imbalance
Lesson Slides Hands-on Recordings

Module 2: Neural networks

Neural networks

Time estimation: 3 hours

Learning Objectives
  • Initializing model with a single layer (code)
  • Loss function
  • Model as equation
  • How model parameters are learned
  • Training steps (code)
  • Predictions and save+load models
  • Initializing model with multiple layers (code)
  • Forward step
  • Concept of backprop and epochs
  • Training (code)
Lesson Slides Hands-on Recordings

Module 3: Deep Learning (without Generative Artificial Intelligence)

Deep Learning (without Generative Artificial Intelligence)

Time estimation: 3 hours

Learning Objectives
  • Input data representation
  • Concept of filters
  • Concept of pooling layers
  • Initialising a model with conv layers (code)
  • Concept of RNNs
  • Concept of attention
  • Implementation of RNN (code)
  • Implementation of attention mechanism (code)
  • Implementation of fine-tuning (code)
Lesson Slides Hands-on Recordings

Module 4: Generative Artificial Intelligence and Large Langage Model for Genomics using Python

This tutorial series provides a comprehensive guide to leveraging large language models for genomics, covering pretraining, fine-tuning, mutation impact prediction, sequence generation, and optimization.

Time estimation: 15 hours

Learning Objectives
  • Identify and load a pre-trained language model (LLM) suitable for DNA sequence analysis.
  • Explain the role of a tokenizer in converting DNA sequences into numerical tokens for model processing.
  • Prepare and tokenize DNA sequence datasets for model training and evaluation.
  • Configure and implement data collation to organize tokenized data into batches for efficient training.
  • Define and configure hyperparameters for pretraining a model, such as learning rate and batch size.
  • Monitor and evaluate the model's performance during training to ensure effective learning.
  • Use the trained model to generate embeddings for DNA sequences and interpret these embeddings for downstream bioinformatics applications.
  • Develop a complete workflow for training a language model on DNA sequences, from data preparation to model evaluation, and apply it to real-world bioinformatics tasks.
  • Load a pre-trained model and modify its architecture to include a classification layer.
  • Prepare and preprocess labeled DNA sequences for fine-tuning.
  • Define and configure training parameters to optimize the model's performance on the classification task.
  • Evaluate the fine-tuned model's accuracy and robustness in distinguishing between different classes of DNA sequences.
  • Explain the concept of zero-shot learning and its application in predicting the impact of DNA mutations using pre-trained large language models (LLMs).
  • Utilize a pre-trained DNA LLM from Hugging Face to compute embeddings for wild-type and mutated DNA sequences.
  • Compare the embeddings of wild-type and mutated sequences to quantify the impact of mutations using L2 distance as a metric.
  • Interpret the results of the L2 distance calculations to determine the significance of mutation effects and discuss potential implications in genomics research.
  • Develop a script to automate the process of predicting mutation impacts using zero-shot learning, enabling researchers to apply this method to their own datasets efficiently.
  • Describe the process of generating synthetic DNA sequences using pre-trained language models and explain the significance of temperature settings in controlling sequence variability.
  • Set up a computational environment (e.g., Google Colab) and configure a pre-trained language model to generate synthetic DNA sequences, ensuring all necessary libraries are installed and configured.
  • Use k-mer counts and Principal Component Analysis (PCA) to compare generated synthetic DNA sequences with real genomic sequences, identifying similarities and differences.
  • Perform BLAST searches to assess the novelty of generated DNA sequences and interpret the results to determine the biological relevance and uniqueness of the synthetic sequences.
  • Develop a pipeline to detect open reading frames (ORFs) within generated DNA sequences and translate them into amino acid sequences, demonstrating the potential for creating novel synthetic genes.
  • pretraining LLM for DNA
  • finetuning LLM
  • zeroshot prediction for DNA variants and synthetic DNA sequence generation.
Lesson Slides Hands-on Recordings

Module 5: Regulations/standards for AI using DOME

Regulations/standards for AI using DOME

Time estimation: 3 hours

Learning Objectives
  • Explain the importance of data provenance and dataset splits in ensuring the integrity and reproducibility of AI research.
  • Develop a comprehensive plan for documenting and sharing AI model configurations, datasets, and evaluation results to enhance transparency and reproducibility in their research.
Lesson Slides Hands-on Recordings

Editorial Board

This material is reviewed by our Editorial Board:

orcid logoBérénice Batut avatar Bérénice Batut