Regression in Machine Learning
Contributors
Questions
How to use regression techniques to create predictive models from biological datasets?
Objectives
Learn regression background
Apply regression based machine learning algorithms
Learn ageing biomarkers and predict age from DNA methylation datasets
Learn how visualizations can be used to analyze predictions
Requirements
last_modification Published: Mar 21, 2025
last_modification Last Updated: Mar 21, 2025
Regression
.pull-left[
- Supervised learning
- Real valued targets
- Cost/error/loss functions
- Algorithms
- Linear model
- Support vectors
- Tree and Ensemble
- Used for
.pull-right[ ]
Cost function
.pull-left[
- Mathematical functions
- Error = True - Predicted
- Examples
- Mean squared error
- Mean absolute error
- Coefficient of determination (R2)
- … ]
.pull-right[ ]
Linear models
.pull-left[
- Learn weight/coefficient for each feature
- y (predicted target) = w0 + w1 x Feature 1 + w2 x Feature 2 + .. + wN x Feature N
- w (weights) = [w0, w1, w2, …, wN]
- X (input features) = [Feature1, Feature2, …, FeatureN]
- Examples
- Linear regression
- Ridge regression
- ElasticNet
- Different variants of the minimisation equation
- Advantage: Simple and fast
- Disadvantage: Problems in learning non-linear relations ]
.pull-right[ ]
Support vector machines
.pull-left[
- Linear and non-linear variants
- Support vectors
- Advantages
- High-dimensional data
- Number of samples « number of dimensions
- Memory efficient - uses only support vectors
- Disadvantages
- Large runtime
- Scale invariant
- Examples: SVR, NuSVR, LinearSVR ]
.pull-right[ ]
K nearest neighbors
.pull-left[
- Prediction based on the nearest neighbours
- Examples
- K Nearest neighbours: based on k neighbours
- Radius neighbours: neighbours within r radius
- Advantages
- Simple to understand
- Non-parametric
- Disadvantages
- Runtime increases with data
- High memory requirements
- Insensitive to outliers ]
.pull-right[ ]
Decision tree
.pull-left[
- Decision rules and paths
- Advantages
- Easy to interpret
- Logarithmic cost for prediction
- Disadvantages
- Sensitive to variations in data
- Prone to overfit
- Need to balance dataset ]
.pull-right[ ]
Ensemble models
.pull-left[
- Combination of multiple trees
- Bagging
- Build independent multiple trees
- Average prediction
- Examples
- Random Forest
- Extremely Randomised Trees
- Boosting
- Improve tree models sequentially
- Combine weak models to robust ensemble
- Examples
- AdaBoost
- GradientBoosting ]
.pull-right[ ]
References
- Linear models - https://scikit-learn.org/stable/modules/linear_model.html
- Support vector machines - https://www.sciencedirect.com/science/article/pii/S0022169406000473
- Nearest neighbours - https://scikit-learn.org/stable/modules/neighbors.html
- Decision tree - https://gdcoder.com/decision-tree-regressor-explained-in-depth/
- Ensemble models - https://www.geosci-model-dev.net/12/1209/2019/
For additional references, please see tutorial’s References section
- Galaxy Training Materials (training.galaxyproject.org)
Speaker Notes
- If you would like to learn more about Galaxy, there are a large number of tutorials available.
- These tutorials cover a wide range of scientific domains.
Getting Help
-
Help Forum (help.galaxyproject.org)
-
Gitter Chat
- Main Chat
- Galaxy Training Chat
- Many more channels (scientific domains, developers, admins)
Speaker Notes
- If you get stuck, there are ways to get help.
- You can ask your questions on the help forum.
- Or you can chat with the community on Gitter.
Join an event
- Many Galaxy events across the globe
- Event Horizon: galaxyproject.org/events
Speaker Notes
- There are frequent Galaxy events all around the world.
- You can find upcoming events on the Galaxy Event Horizon.
Key Points
- Regression is a supervised approach in machine learning.
- For regression tasks, data is divided into training and test sets.
- Using regression, the samples are learned using the training set and predicted using the test set.
- For each regression algorithm, the parameters should be optimised based on the dataset.
- Regression methods can be applied to, for example, RNA-seq gene expression to predict biological age.
Thank you!
This material is the result of a collaborative work. Thanks to the Galaxy Training Network and all the contributors!