name: inverse layout: true class: center, middle, inverse
---
# Regression in Machine Learning
Anup Kumar
last_modification
Updated:
purl
PURL
:
gxy.io/GTN:S00137
text-document
Plain-text slides
|
Tip:
press
P
to view the presenter notes |
arrow-keys
Use arrow keys to move between slides
??? Presenter notes contain extra information which might be useful if you intend to use these slides for teaching. Press `P` again to switch presenter notes off Press `C` to create a new window where the same presentation will be displayed. This window is linked to the main window. Changing slides on one will cause the slide to change on the other. Useful when presenting. --- ## Requirements Before diving into this slide deck, we recommend you to have a look at: - [Introduction to Galaxy Analyses](/training-material/topics/introduction) - [Statistics and machine learning](/training-material/topics/statistics) - Basics of machine learning: [
slides
slides](/training-material/topics/statistics/tutorials/machinelearning/slides.html) - [
tutorial
hands-on](/training-material/topics/statistics/tutorials/machinelearning/tutorial.html) --- ### <i class="far fa-question-circle" aria-hidden="true"></i><span class="visually-hidden">question</span> Questions - How to use regression techniques to create predictive models from biological datasets? --- ### <i class="fas fa-bullseye" aria-hidden="true"></i><span class="visually-hidden">objectives</span> Objectives - Learn regression background - Apply regression based machine learning algorithms - Learn ageing biomarkers and predict age from DNA methylation datasets - Learn how visualizations can be used to analyze predictions --- # Regression .pull-left[ - Supervised learning - Real valued targets - Cost/error/loss functions - Algorithms - Linear model - Support vectors - Tree and Ensemble - Used for - [Predict gene expression pattern](https://www.frontiersin.org/articles/10.3389/fgene.2019.00120/full) - [Estimate DNA copy number](https://academic.oup.com/bioinformatics/article/28/18/2357/252846) - [Identify drug response](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2881366/) ] .pull-right[ ] --- # Cost function .pull-left[ - Mathematical functions - Error = True - Predicted - Examples - Mean squared error - Mean absolute error - Coefficient of determination (R2) - ... ] .pull-right[ ] --- # Linear models .pull-left[ - Learn weight/coefficient for each feature - y (predicted target) = w0 + w1 x Feature 1 + w2 x Feature 2 + .. + wN x Feature N - w (weights) = [w0, w1, w2, …, wN] - X (input features) = [Feature1, Feature2, …, FeatureN] - Examples - Linear regression - Ridge regression - ElasticNet - Different variants of the minimisation equation - Advantage: Simple and fast - Disadvantage: Problems in learning non-linear relations ] .pull-right[ ] --- # Support vector machines .pull-left[ - Linear and non-linear variants - Support vectors - Advantages - High-dimensional data - Number of samples << number of dimensions - Memory efficient - uses only support vectors - Disadvantages - Large runtime - Scale invariant - Examples: SVR, NuSVR, LinearSVR ] .pull-right[ ] --- # K nearest neighbors .pull-left[ - Prediction based on the nearest neighbours - Examples - K Nearest neighbours: based on k neighbours - Radius neighbours: neighbours within r radius - Advantages - Simple to understand - Non-parametric - Disadvantages - Runtime increases with data - High memory requirements - Insensitive to outliers ] .pull-right[ ] --- # Decision tree .pull-left[ - Decision rules and paths - Advantages - Easy to interpret - Logarithmic cost for prediction - Disadvantages - Sensitive to variations in data - Prone to overfit - Need to balance dataset ] .pull-right[ ] --- # Ensemble models .pull-left[ - Combination of multiple trees - Bagging - Build independent multiple trees - Average prediction - Examples - Random Forest - Extremely Randomised Trees - Boosting - Improve tree models sequentially - Combine weak models to robust ensemble - Examples - AdaBoost - GradientBoosting ] .pull-right[ ] --- # References - Linear models - https://scikit-learn.org/stable/modules/linear_model.html - Support vector machines - https://www.sciencedirect.com/science/article/pii/S0022169406000473 - Nearest neighbours - https://scikit-learn.org/stable/modules/neighbors.html - Decision tree - https://gdcoder.com/decision-tree-regressor-explained-in-depth/ - Ensemble models - https://www.geosci-model-dev.net/12/1209/2019/ --- # For additional references, please see tutorial's References section --- - Galaxy Training Materials ([training.galaxyproject.org](https://training.galaxyproject.org))  ??? - If you would like to learn more about Galaxy, there are a large number of tutorials available. - These tutorials cover a wide range of scientific domains. --- # Getting Help - **Help Forum** ([help.galaxyproject.org](https://help.galaxyproject.org))  - **Gitter Chat** - [Main Chat](https://gitter.im/galaxyproject/Lobby) - [Galaxy Training Chat](https://gitter.im/Galaxy-Training-Network/Lobby) - Many more channels (scientific domains, developers, admins) ??? - If you get stuck, there are ways to get help. - You can ask your questions on the help forum. - Or you can chat with the community on Gitter. --- # Join an event - Many Galaxy events across the globe - Event Horizon: [galaxyproject.org/events](https://galaxyproject.org/events)  ??? - There are frequent Galaxy events all around the world. - You can find upcoming events on the Galaxy Event Horizon. --- ### <i class="fas fa-key" aria-hidden="true"></i><span class="visually-hidden">keypoints</span> Key points - Regression is a supervised approach in machine learning. - For regression tasks, data is divided into training and test sets. - Using regression, the samples are learned using the training set and predicted using the test set. - For each regression algorithm, the parameters should be optimised based on the dataset. - Regression methods can be applied to, for example, RNA-seq gene expression to predict biological age. --- ## Thank You! This material is the result of a collaborative work. Thanks to the [Galaxy Training Network](https://training.galaxyproject.org) and all the contributors!
Author(s)
Anup Kumar
Reviewers
Tutorial Content is licensed under
Creative Commons Attribution 4.0 International License
.