Supervised Learning with Hyperdimensional Computing

Author(s)	Fabio Cumbo
Reviewers

Overview
Questions:

How to encode data into vectors in a high-dimensional space?

What kind of operations can be performed on these vectors?

What is a vector-symbolic architecture?

How to build a classification model out of this architecture?

Objectives:

Learn how to encode data into high-dimensional vectors

Build a vector-symbolic architecture

Use the architecture to build a classification model

Time estimation: 30 minutes

Level: Intermediate Intermediate

Supporting Materials:

Datasets

FAQs

instances Available on these Galaxies

Known Working

UseGalaxy.org (Main) ✅ ⭐️

Published: Apr 28, 2023

Last modification: Nov 9, 2023

License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT

purl PURL: https://gxy.io/GTN:T00337

version Revision: 2

chopin2 (Cumbo et al. 2020) implements a domain-agnostic supervised classification method based on the hyperdimensional (HD) computing paradigm. It is an open-source tool and its code is available on GitHub at https://github.com/cumbof/chopin2.

In this tutorial, we are going to work on a dataset with microbial relative abundances (RA) and presence/absence information (BIN) computed on metagenomic samples collected from individuals affected by colorectal cancer (CRC) in a case/control scenario. The main goal is to build a supervised classification model over the RA profiles with chopin2 in order to discriminate case and control samples with a high accuracy.

The dataset with RA profiles has been produced with MetaPhlAn3 (Beghini et al. 2021) and it is available through the curatedMetagenomicData package for R (Pasolli et al. 2017). The studies describing the analysis of the metagenomic samples are available in Nature Medicine (Thomas et al. 2019 Wirbel et al. 2019). In particular, the dataset contains the profiles of 241 microbial species detected in 101 stool samples of patients affected by CRC, and 92 samples collected from the stool of healthy individuals.

The data we use in this tutorial is also available on Zenodo.

Please note that both the RA and BIN datasets in Zenodo have also been stratified according to the age category and sex of both case and control individuals. However, in this tutorial we are going to analyze the unstratified datasets, called RA__ThomasAM__species.csv and BIN__ThomasAM__species.csv.

Agenda

A chopin2 analysis is composed of three steps:

Get the data

Build a classification model

Feature selection

Get the data

The first step consists in importing the RA and BIN datasets into a Galaxy history. As previously mentioned, RA refers to the dataset with microbial relative abundance profiles computed with MetaPhlAn3, while BIN refers to the dataset with presence/absence information about microbes in samples.

Hands On: Get the data
Create a new history for this tutorial

To create a new history simply click the new-history icon at the top of the history panel:
Import datasets from Zenodo:

RA dataset (RA__ThomasAM__species.csv)

BIN dataset (BIN__ThomasAM__species.csv)
https://zenodo.org/record/7806264/files/RA__ThomasAM__species.csv
https://zenodo.org/record/7806264/files/BIN__ThomasAM__species.csv
Copy the link location

Click galaxy-upload Upload at the top of the activity panel

Select galaxy-wf-edit Paste/Fetch Data

Paste the link(s) into the text field

Press Start

Close the window

When uploading the datasets, set the type to csv. Both datasets contain samples in the columns and microbial species in the rows. The RA dataset contains relative abundance profiles, thus values ranging from 0.0 to 100.0, while the BIN dataset contains presence/absence (i.e., 0 or 1) information computed on the relative abundance. It means that the value of the cell ij in the BIN matrix is 1 if the relative abundance of the species i in the sample j is greater than 0.0 in the RA matrix. Otherwise, it is 0.
Edit history item attributes galaxy-pencil

Make sure dataset names are clear, like RA__ThomasAM__species.csv and BIN__ThomasAM__species.csv.

If you did not previously set the datatype to csv, you can do that now under the convert tab.

Click on the galaxy-pencil pencil icon for the dataset to edit its attributes

In the central panel, change the Name field

Click the Save button

Build a classification model

Hyperdimensional computing is an emerging brain-inspired computing paradigm that deals with vectors in a high-dimensional space. Every kind of information is thus encoded into random 10,000-length vectors (usually binary or bipolar) that are combined together to represent complex concepts. The length of the vectors is usually in the order of 10,000 to guarantee their quasi-orthogonality.

There are usually three types of arithmetic operators that can be applied to combine vectors: bundling, binding, and permutation. They all have unique properties that must be taken into account when used (Kanerva 2009).

For what concerns the bundling operator: (i) the resulting vector is similar to the input vectors, (ii) the more vectors that are involved in bundling, the harder it is to determine the component vectors, (iii) if several copies of any vector are included in bundling, the resulting vector is closer to the dominant vector than to the other components.

The binding operation is instead: (i) invertible (unbinding), (ii) it distributes over bundling, (iii) it preserves the distance, (iv) and the resulting vector is dissimilar to the input vectors.

Finally, the permutation operation is: (i) invertible, (ii) it distributes over bundling and any elementwise operation, (iii) it preserves the distance, and (iv) the resulting vector is dissimilar to the input vectors.

The set of vectors and arithmetic operators used for combining vectors is called vector-symbolic architecture.

chopin2 tool implements a supervised classification model that works by encoding every observation in a dataset into high-dimensional vectors by combining values under their features. The model is built by collapsing the vector representation of the observations in the training set into a vector for each class of observations (e.g., in the case of the datasets retrieved in the previous step, classes are CRC and control). The classification model is then tested by computing the inner product between the vector representations of the observations in the test set against the two classes of vectors. Vectors are thus classified based on their closeness to the classes.

Hands On: Build a classification model with `chopin2`

chopin2 ( Galaxy version 1.0.7+galaxy1) with the following configuration:

param-file “Select a dataset”: RA__ThomasAM__species.csv;

param-text “Vector dimensionality”: 10000;

param-text “Levels”: 100;

param-text “Model retraining iterations”: 10;

param-text “Number of folds for cross-validation”: 5;

param-select “Enable feature selection”: Disabled

The number of levels is the number of random vectors that chopin2 tool generates for encoding data. This must be carefully selected since the accuracy of the resulting model is strictly correlated to this parameter (Cumbo et al. 2020).

The tool will create a summary file in your history containing a few basic information about the generated classification model and, more importantly, its accuracy. You can open this file by clicking the galaxy-eye (eye) icon.

We suggest that you repeat the same steps for building a classification model on the binary presence/absence profiles in BIN__ThomasAM__species.csv. In this case there is no need to use many levels, since the dataset contains only two possible values (i.e., 0 and 1). Thus, the number of levels must be changed to 2.

Question

Compare the summary files generated by chopin2 tool as the result of building a classification model over the RA__ThomasAM__species.csv and BIN__ThomasAM__species.csv datasets.

Is there a difference in the accuracy of the two models?

Does the difference in the number of levels have an impact on the running time and the final accuracy of the models?

The accuracy of the model built on the BIN__ThomasAM__species.csv dataset is over 80%, while the accuracy of the model built on the RA__ThomasAM__species.csv dataset is around 75%.

The tool generates as many hyperdimensional vectors as the number of levels. In cases of a high number of levels, the tool could take a while to generate all the hyperdimensional vectors initially, but it does not affect its speed in the generation of the classification model. However, it could heavily affect the memory consumption. Additionally, the number of levels has an impact on the accuracy of the model. It is recommended to choose as many levels as the number of unique values in the datasets, by also taking into account the precision of the numerical values.

Feature selection

chopin2 tool also implements a feature selection method based on the backward variable elimination strategy. It means that the tool will produce a classification model starting with the whole set of features in the dataset and iteratively remove those features that do not contribute to the accuracy of the model.

Be aware that this specific type of feature selection method could lead to the generation of thousands of classification models in order to determine the best features and discard those ones that do not significantly contribute to a good accuracy. However, despite the huge amount of computational resources required to run the algorithm, it can easily handle datasets with massive amounts of features.

Hands On: Identify the best features

chopin2 ( Galaxy version 1.0.7+galaxy1) with the following configuration:

param-file “Select a dataset”: BIN__ThomasAM__species.csv;

param-text “Vector dimensionality”: 10000;

param-text “Levels”: 2;

param-text “Model retraining iterations”: 10;

param-text “Number of folds for cross-validation”: 5;

param-select “Enable feature selection”: Enabled

We are going to focus on the BIN__ThomasAM__species.csv dataset only this time since chopin2 tool produced a classification model with a better accuracy compared to that of the model built over the RA__ThomasAM__species.csv dataset.

The tool will end up creating a summary file in your history (click on the galaxy-eye (eye) icon to check its content). It contains a line for each step of the backward variable elimination method. Every line corresponds to a classification model. Please note that the last column reports the features excluded in each step.

A selection file will also appear in your history with the list of best features selected by chopin2 tool (you can also inspect this file by clicking on the galaxy-eye (eye) icon). Some of them are well known to be actually linked in someway to the genesis and development of CRC as a simple search on the Disbiome (Janssens et al. 2018) database can confirm. However, these features have been selected because they all contribute to define the best accuracy according to the classification model and the feature selection technique implemented in chopin2 tool, and they do not necessarily have a biological relevance. Thus, this is not enough for identifying true possible biomarkers for CRC, and further statistical analyses must be performed to validate and support these findings.

Question

Look at the list of microbial species (features) selected by chopin2 tool reported in the selection file. Can you recognize any microbial species known to be linked in some way to the genesis and development of colorectal cancer?

Clostridium symbiosum, Gemella morbillorum, and Parvimonas micra, among the other selected species, are well known CRC-enriched bacteria. These species have all been proposed as biomarkers for an early detection of the disease (Xie et al. 2017 Yao et al. 2021 Zhao et al. 2022).

chopin2 tool is the first tool that implements a feature selection method based on the hyperdimensional computing paradigm.

You've Finished the Tutorial

Key points

Every kind of data can be encoded into high-dimensional vectors

A vector-symbolic architecture is composed of vectors and a limited set of arithmetic operators

Classification models build according to the hyperdimensional computing paradigm can scale on datasets with massive amounts of features and data points

Frequently Asked Questions

Have questions about this tutorial? Have a look at the available FAQ pages and support channels

References

Kanerva, P., 2009 Hyperdimensional computing: An introduction to computing in distributed representation with high-dimensional random vectors. Cognitive computation 1: 139–159. 10.1007/s12559-009-9009-8
Pasolli, E., L. Schiffer, P. Manghi, A. Renson, V. Obenchain et al., 2017 Accessible, curated metagenomic data through ExperimentHub. Nature methods 14: 1023–1024. 10.1038/nmeth.4468
Xie, Y.-H., Q.-Y. Gao, G.-X. Cai, X.-M. Sun, T.-H. Zou et al., 2017 Fecal Clostridium symbiosum for noninvasive detection of early and advanced colorectal cancer: test and validation studies. EBioMedicine 25: 32–40. 10.1016/j.ebiom.2017.10.005
Janssens, Y., J. Nielandt, A. Bronselaer, N. Debunne, F. Verbeke et al., 2018 Disbiome database: linking the microbiome to disease. BMC microbiology 18: 1–6. 10.1186/s12866-018-1197-5
Thomas, A. M., P. Manghi, F. Asnicar, E. Pasolli, F. Armanini et al., 2019 Metagenomic analysis of colorectal cancer datasets identifies cross-cohort microbial diagnostic signatures and a link with choline degradation. Nature medicine 25: 667–678. 10.1038/s41591-019-0405-7
Wirbel, J., P. T. Pyl, E. Kartal, K. Zych, A. Kashani et al., 2019 Meta-analysis of fecal metagenomes reveals global microbial signatures that are specific for colorectal cancer. Nature medicine 25: 679–689. 10.1038/s41591-019-0406-6
Cumbo, F., E. Cappelli, and E. Weitschek, 2020 A brain-inspired hyperdimensional computing approach for classifying massive dna methylation data of cancer. Algorithms 13: 233. 10.3390/a13090233
Beghini, F., L. J. McIver, Blanco-Mı́guez Aitor, L. Dubois, F. Asnicar et al., 2021 Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3. elife 10: e65088. 10.7554/eLife.65088
Yao, Y., H. Ni, X. Wang, Q. Xu, J. Zhang et al., 2021 A new biomarker of fecal bacteria for non-invasive diagnosis of colorectal cancer. Frontiers in Cellular and Infection Microbiology 11: 1262. 10.3389/fcimb.2021.744049
Zhao, L., X. Zhang, Y. Zhou, K. Fu, H. C.-H. Lau et al., 2022 Parvimonas micra promotes colorectal tumorigenesis and is associated with prognosis of colorectal cancer patients. Oncogene 41: 4200–4210. 10.1038/s41388-022-02395-7

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Citing this Tutorial

Fabio Cumbo, Supervised Learning with Hyperdimensional Computing (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/statistics/tutorials/hyperdimensional_computing/tutorial.html Online; accessed TODAY
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

@misc{statistics-hyperdimensional_computing,
author = "Fabio Cumbo",
	title = "Supervised Learning with Hyperdimensional Computing (Galaxy Training Materials)",
	year = "",
	month = "",
	day = "",
	url = "\url{https://training.galaxyproject.org/training-material/topics/statistics/tutorials/hyperdimensional_computing/tutorial.html}",
	note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
	doi = {10.1371/journal.pcbi.1010752},
	url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
	year = 2023,
	month = {jan},
	publisher = {Public Library of Science ({PLoS})},
	volume = {19},
	number = {1},
	pages = {e1010752},
	author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
	editor = {Francis Ouellette},
	title = {Galaxy Training: A powerful framework for teaching!},
	journal = {PLoS Comput Biol}
}

                   

Congratulations on successfully completing this tutorial!

You can use Ephemeris's shed-tools install command to install the tools used in this tutorial.

shed-tools install [-g GALAXY] [-a API_KEY] -t <(curl https://training.galaxyproject.org/training-material/api/topics/statistics/tutorials/hyperdimensional_computing/tutorial.json | jq .admin_install_yaml -r)

Alternatively you can copy and paste the following YAML

---
install_tool_dependencies: true
install_repository_dependencies: true
install_resolver_dependencies: true
tools:
- name: chopin2
  owner: iuc
  revisions: d49893faf877
  tool_panel_section_label: Machine Learning
  tool_shed_url: https://toolshed.g2.bx.psu.edu/

No feedback has been recieved yet for this training. Be the first one by filling in the feedback form.