View markdown source on GitHub

Automated Cell Annotation

Contributors

Authors: orcid logoMorgan Howells avatar Morgan Howells

Questions

Objectives

Requirements

last_modification Published: Sep 4, 2023
last_modification Last Updated: Oct 9, 2023

What is Cell Annotation?

Basic pipeline for automated cell annotation

Speaker Notes

Automated cell annotation uses gene expression data to classify unknown cells (often referred to as queries) into discrete “cell type” categories.

This process can be done manually or automatically using a wide range of algorithms and statistical methods


Why is it Important?

Speaker Notes

Manually annotating cells can be time consuming and require domain expertise, with the advent of single-cell RNA sequencing (scRNA-seq) large amounts of individual cell data can be processed, on order to annotate this large data automated methods need to be produced.

Additionally these automated tools can help produce more consistent results and allow reproducibility as they are open and usable by anybody.

As mentioned earlier determining cell types can be complex with thousands of cell types and subtypes and large amounts of potential data to process, therefore automated analysis makes interpreting this data and uncovering complex patterns much easier.


Understanding our Data

.pull-right[

Diagram showing the different sections of a AnnData object

]

.pull-left[

What data should we use to classify cells?

Common data types:

]

Speaker Notes

Cells are classified using gene expression data derived from single-cell RNA sequencing, this data tells us which genes are being actively expressed in a cell at a given time.

The expression data is stored in a gene expression matrix where each row represents genes and each column represents individual cells, in the diagram this matrix is represented by the green box, all other boxes represent different types of metadata.

There are many different formats for storing this data and metadata, since different tools use different data types converting between them is a common task.


Challenges

Speaker Notes

There are some challenges that need to be faced in order to perform automated cell annotation:


Manual Cell Annotation

Various cell cluster diagrams showing the expression values of various marker genes

Speaker Notes

In order to manually annotate cell data you would first need to reduce the number of dimensions in your data so that you can produce cluster maps, this is done using a reduction method (PCA, tSNE, UMAP, etc)

Now that your data has been separated into distinct clusters you now need to manually annotate each individual cluster. This is typically done using known marker genes and their associated cell types.

The figure shows a graph for each marker gene we are interested in, it can be seen that each marker gene is only present in certain areas of our data, ideally these areas will be individual clusters.


Feature Selection

.pull-right[

Diagram showing a distribution of gene expressions - outlining genes further away from the mean being better features

]

.pull-left[

]

Speaker Notes

Similar to manual annotation we need to use certain genes as indicators for determining each cell type (since certain gene profiles will correlate more strongly with certain cell types), these are referred to as features/marker genes.

The best gene expressions to use as features will be the ones that vary a lot across different cell type samples as certain gene expressions can be used as a ‘fingerprint’ for that certain cell type.

Features can be selected using various automated methods, some common methods include:

Many tools allow you to choose how many features to use, performance will differ depending on the tool but as a general rule too many features will mean the useful information is too sparse and it will be difficult to find patterns, whereas too few features will lose a lot of useful information. It is best to first use the number of features recommended for the tool and make adjustments using that as a baseline.


Different Classification Methods

Overview of the three different types of methods (marker gene database, correlation-based, supervised classification)

Speaker Notes

There are three primary types of methods that can be applied to classifying cell types


Correlation-Based Methods

.pull-right[

Figure showing a basic correlation-based pipeline

]

Speaker Notes

Correlation-based methods take the query and compares it to every cell/cluster in the reference dataset in an attempt to figure out which cells/clusters the query is most similar to. If the query has a strong similarity to a cell/cluster then it is probable that they share the same cell type classification, this idea is the basis of this method.

Similarity between the reference and the query is computed using various distance metric scores (cosine similarity, Pearson correlation, Spearman correlation, etc) and operates on gene expression data.

By using feature selection we can improve the efficiency of comparing each cell/cluster since there is less gene data to process. Additionally the reference data is stored as an n-dimensional feature space (where each gene is a separate dimension) which allows for the identification of clusters as well as faster searching using algorithms such as approximate nearest neighbour.

This method can be applied to entire clusters or to individual cells in the reference data. Applying the query to entire clusters can help reduce computation time but lacks accuracy, comparing the query to each individual cell yields much better accuracy with the expense of time.


Cluster Annotation with Marker Genes

.pull-right[

Figure showing a basic marker gene database pipeline

]

Speaker Notes

Cluster annotation methods rely on known marker genes for identifying cell types. These genes are hand-selected features that can be used to uniquely identify different cell types. Using a database of marker genes, query data can be compared against the database to determine the most likely matching cell types.

In order to compare the query with known marker gene databases features need to be extracted from the query data, this can be done automatically using the methods described in the feature selection slide.

This method benefits from utilising a known database of cell type markers, these can be published and reused with different methods which allows for easy replication of results as well as allowing researchers to compare algorithms since the reference data remains constant.

This method has some downsides however, it can take a lot of prior manual labor and expertise in order to create an initial database of marker genes. The databases that are available can also limit which cell types can be annotated and may not contain the best marker genes.


Supervised Classification

.pull-left[

Diagram of a support vector machine

]

.pull-right[

Diagram of a random forrest classifier

]

Speaker Notes

Supervised classification trains a statistical model using some labelled reference data in order to learn a mapping between input data (gene expression values) and cell type categories.

There are many algorithms that can be used to train a classifier such as a Support Vector Machine, Artificial Neural Network, Random Forest Classifier, etc. However all these methods operate using some form of statistical analysis.

This method benefits from not requiring any human curated marker genes and is typically able to operate with more noisy data and can handle higher dimensionality data (which typically means being able to use more feature genes in the classification process).

However it should be noted that in order to train a classifier for identifying cell types a good reference dataset of labeled data is required. The size and accuracy of this data will affect the quality of the trained classifier, it can also be time consuming to train a accurate classifier for use in prediction. Additionally in order to use more reference data or add more classifications after training, you will need to either fine-tune or completely re-train the model using the additional data/classifications.


Comparison of Methods

Approach Description Advantages Limitations
Correlation based Compares gene expression patterns between a query dataset and a reference dataset, using measures such as Pearson or Spearman correlation coefficients Comprehensive annotation, flexibility with multiple references and data merging. Can be applied at cell and cluster level. Simple and fast to compute. Performance decreases with the number of features used. Potential bias from reference selection.
Cluster annotation with marker genes Matches the expression patterns of specific marker genes to reference cell types Use of marker gene databases provides a comprehensive collection of known cell type markers. Uncertain annotation if query data is not clean. Relies on human annotated known cell type markers for annotation.
Supervised Classification Machine learning approach that learns the relationship between genes and cell types. Robust to data noise and batch effects, higher accuracy with appropriate training. Can handle high dimensional data. Training step is required which can be computationally intensive. Need to re-train or finetune an existing model to add more reference data.

Speaker Notes

Here is an overview table presenting the information covered in the previous slides.

This can be used as a quick reference for what methods are available, an overview of how they work, and some things to consider when deciding which method is right for you.


Visualisation

Figure showing multiple clusters with descriptions of how they have failed

Speaker Notes

Visualisation can be an important step in classifying cell types, after processing the query data and generating predicted cell type labels it can be good to validate the predictions.

The easiest way to validate the predicted results is to plot the data on a graph and visually inspect the clusters, quickly inspecting the graph can help identify any errors in the classification process and determine what could be causing the issue.

The most common method to produce visualisations is to reduce the dimensionality of the data using methods such as UMAP, tSNE, PCA and plot the output on a 2-dimensional graph, each cluster can then be labelled with the predicted cell types.

Many tools for automated cell annotation offer the ability to plot the predicted data.


Common Issues

Speaker Notes

When performing cell annotation with any tool you may encounter some issues where the annotations are largely wrong or unassigned, if this happens then there are some common things that you should check:


Key Points

Thank you!

This material is the result of a collaborative work. Thanks to the Galaxy Training Network and all the contributors! Galaxy Training Network Tutorial Content is licensed under Creative Commons Attribution 4.0 International License.