Trajectory analysis
Contributors
Questions
What is trajectory analysis?
What are the main methods of trajectory inference?
How are the decisions about the trajectory analysis made?
What to take into account when choosing the method for your data?
Objectives
Become familiar with the methods of trajectory inference
Learn how the algorithms produce outputs
Be able to choose the method appropriate for your specific data
Gain insight into methods currently available in Galaxy
Requirements
What is trajectory analysis?
.footnote[Deconinck et al. 2021]
–
Trajectory inference (TI) methods have emerged as a novel subfield within computational biology to better study the underlying dynamics of a biological process of interest, such as:
cellular development  differentiation  immune responses 

.image40[ ]  .image40[ ]  .image40[ ] 
– TI allows us to study how cells evolve from one cell state to another, and subsequently when and how cell fate decisions are made.
.image40[ ]
Speaker Notes

TI helps to understand  ‘visualise’  the biological process that the green cell went into
Clustering, trajectory and pseudotime
.footnote[Deconinck et al. 2021] –
.pullleft[ Clustering calculates cell similarities to group specific cell types, that can be identified based on the marker genes expressed in each cluster.
.image75[ ]
]
Speaker Notes
 axes labels  here tSNE  the method of dimentionality reduction that was used
 each coloured group of cells => different clusters

combine clusters and you get partition(s)
.pullright[ Trajectory inference helps to understand how those cell types are related  whether cells differentiate, change in response to stimuli or over time. To infer trajectories, we need data from cells at different points along a path of differentiation. This inferred temporal dimension is known as pseudotime. Pseudotime measures the cells’ progress through the transition.
.image60[ ]
]
Speaker Notes _______________
 axes labels  here UMAP was used as the method of dimentionality reduction
 plot: usually pseudotime goes through range of colours to depict progress through the transition: here dark blue = root cells, yellow = end cells
 plot: also here’s the example of disjoint trajectories  you might choose if you want to learn single tree structure for all the partitions or learn the disjoint trajectory within each partition
Assumptions
Speaker Notes
 snapshot data with cells in different stages of differentiation

multiple methods and how they work in biological context
 It would be quite difficult to analyse a sample every few seconds to see how the cells are changing. Therefore, we assume that snapshot data encompass all naïve, intermediate, and mature cell states with sufficient sampling coverage to allow the reconstruction of differentiation trajectories.
Sagar and Grün 2020
–
 As different TI methods make different assumptions about the data, a first choice to make is based on which biological process is to be expected: not all TI methods are designed to infer all kinds of biological processes.
Deconinck et al. 2021
Why different TI methods are designed to infer different kinds of biological processes?
Speaker Notes There are multiple algorithms used to analyse scRNAseq data to make it more readable. Note that this is a graph from 2016  many new approaches has been developed since then, but this scheme shows how those methods are built. –
Look at the pipeline presented by Cannoodt et al. 2016
–
.image100[ ]
There are multiple methods using particular algorithms, or even their combinations, so you must consider which one would be best for analysing your sample.
Speaker Notes We will consider some aspects shown here to better understand similarities and differences of TI methods. Those are also the things that highly affect the output or are quite important to be aware of.
Similarity
.pullleft[
]
–
.pullright[
 Think of similarity as a distance between cells (nearer cells  more similar)
 Euclidean distance is usually used, it’s just mathematically defined distance between two points
]
Speaker Notes
 distance matrix details: intro, slide 45
Manifold learning / dimensionality reduction
.pullleft[
]
–
.pullright[
 Making a manifold is like making a flat map of a sphere (the Earth)
 We try to ‘flatten’ our data, changing it from highdimensional to lowdimensional
 UMAP (Uniform Manifold Approximation and Projection) is another often used dimensionality reduction method
.image100[ ]
]
Speaker Notes
 Assumes that data lays on a lowdimensional manifold embedded in a higherdimensional space
 Uses principal curves and graphs to obtain smooth trajectories
 figure shows how using different dimensionality reduction methods in Monocle3 affects the outcome
Clustering
.pullleft[
]
–
.pullright[
 Tries to find stable cell states in the data, then connect these states to form a trajectory
 soft Kmeans (fuzzy clustering), Louvain, hierarchical clustering, nonnegative matrix factorization
]
Speaker Notes
 cell state  which genes are expressed, which proteins are produced, what is its function, how it responds to external stimuli. Cell state changes and that’s what we want to inspect
 cell state with a particular function can be called a cell type
 methods explained in intro, slides 54  57
 soft Kmeans: number of clusters are defined before hand, and initialised in random positions
 Louvain: extracts communities from large networks
 nonnegative matrix factorization: splits your data up into a set of individual signals and weights to apply to those signals to recreate your original data
 hierarchical clustering: builds a hierarchy of clusters
Graphbased approach
.pullleft[
]
–
.pullright[
 think of ‘graph structure’ as the nodes (clusters with cell states) and edges (similarity, based on eg. Euclidean distance)
 Many graphbased methods start by building a **knearest neighbors (kNN) ** graph using the Euclidean distance
 Minimum weight spanning tree (MST) is often used to connect the clusters to each other
]
Speaker Notes
 General idea is to construct a graph representation of the cells and perform graph decomposition to reveal connected and disconnected components, or graph diffusion or traversal methods to construct the trajectory topology
 kNN explained in intro, slide 46

MST connects clusters without any cycles and minimises the total total edge weight (cost) to ensure that the most similar clusters are connected
Graphbased approach
.footnote[Haghverdi et al. 2016]
.pullleft[
]
.pullright[
 Diffusion pseudotime (DPT) uses random walks from a userprovided root cell
 random walks between data points — path created between cells in distinct stages of the differentiation process
 short walk => higher probability that the cells are connected
 finding path (pseudotime) based on those probabilities
]
Extensions of trajectory inference: RNA velocity methods
.footnote[Deconinck et al. 2021] –
 RNA velocity methods estimate the future state of a single cell captured in a static snapshot by looking at the ratios between spliced mRNA, unspliced mRNA, and mRNA degradation.
 assuming that the cells with the highest unspliced : spliced reads are in the steady state
 high amount of unspliced reads and low amount of spliced reads => gene is likely being turned on => higher positive velocity
.image40[ ]
Extensions of trajectory inference: RNA velocity methods
.footnote[Deconinck et al. 2021]
 scVelo and CellRank use this extra velocity information to construct a directed knearest neighbor graph, as a starting step for the TI method.
–
 This has the advantage that root cell specification is not necessary, and adds directional information to the trajectory.
–
.image25[ ]
When analysing your data, consider the following:
–
.pullleft[

Tissue from which the cells were analysed

Branching points

Supervised and unsupervised learning

Format of the data

Number of cells and features

Computing power & running time
]
–
.pullright[
To help you evaluate which method would work best for your data, check out this awesome comparision site  dynguidelines, a part of a larger set of open packages for doing and interpreting trajectories called the dynverse.
]
When analysing your data, consider the following: Tissue from which the cells were analysed
–
.pullleft[

Gives an idea of the type of cell relationships you would expect and the prior biological knowledge you can feed the method with, as well as expected cell types and potential contaminants

You need to know about 75% of your data and verify that your analysis shows that, before you can then identify the 25% new information

Trajectory analysis is quite a sensitive method, so always check if the obtained computational results make biological sense!
]
–
.pullright[
_{ Reprinted from “Heterogeneity of Murine Lymph Node Stromal Cell Subsets”, by BioRender.com (2022) } ]
When analysing your data, consider the following: Branching points
–
Cells can differentiate or develop in various way, so they may exhibit different topologies.
–
When analysing your data, consider the following: Branching points
.pullleft[
 diffusion pseudotime (DPT) and branch points
 DPT does not rely on dimension reduction and can thus detect subtle changes in highdimensional gene expression patterns

DPTbased analysis proceeds by identifying branching points that occur after cells progress from a root cell through a common ‘trunk trajectory’
_{Haghverdi et al. 2016}  binning and DNB (Dynamical Network Biomarker) theory
 order cells, according to the pseudotime, evenly segmented into nonoverlap bins with each bin containing X cells
 categorize DNB and nonDNB markers for each bin
 one bin contains some unassigned cells  annotated as branch bin; of the remaining bins, there are bins on the trunk, and bins for each of the branch1 and branch2
_{Chen et al. 2018}
]
–
.pullright[
]
Speaker Notes
 Branching points are determined by comparing two independent DPT orderings over cells, one starting at the root cell x and the other at its maximally distant cell y. The two sequences of pseudotimes are anticorrelated until the two orderings merge in a new branch, where they become correlated. This criterion robustly identifies branching points
 Branching points are identified as points where anticorrelated distances from branch ends become correlated
When analysing your data, consider the following: Supervised and unsupervised learning
–
.pullleft[
]
–
.pullright[

Some algorithms work fully unsupervised, ie. the user is not required to input any priors.

However, many of them take the ‘starting cell’, or ‘end cell’ as information that helps to infer the trajectory which best represents the actual biological processes.

Priors can bias the outcome of the method, but if chosen appropriately, they are really helpful.
]
When analysing your data, consider the following: Supervised and unsupervised learning
If you know which cells are root cells, you should enter this information to the method to make the computations more precise. However, some methods use unsupervised algorithms, so you will get a trajectory based on the tools they use and topology they can infer.
–
Unsupervised  Priors needed: start cells  Priors needed: end cells  Priors needed: both start and end cells 

Slingshot, SCORPIUS, Angle, MST, Waterfall, TSCAN, SLICE, pCreode, SCUBA, RaceID/StemID, Monocle DDRTree  PAGA Tree, PAGA, Wanderlust, Wishbone, topslam, URD, CellRouter, SLICER  MFA, GrandPrix, GPfates, MERLoT  Monocle ICA 
When analysing your data, consider the following: Format of the data
–
 You have to check that your chosen method is compatible with the format of your data. If not, consider converting it or even contacting the developers to implement this into the pipeline.
–
 Other possible inputs:
 Raw (FASTQ);
 Cellxgene matrix;
 Matrix, genes & barcodes tables;
 AnnData
 RDS
–
 Powerful SCEasy tool in Galaxy to convert between formats
–
 Implementations
.image120[ ]
When analysing your data, consider the following: Number of cells and features
–
 The more cells and features you have, the more dimensions there are in your dataset. Therefore, it may be more computationally difficult for the particular method to reduce the dimensionality and infer the correct trajectory.
–
 For example, PAGA generally performs better than RaceID or Monocle when the dataset contains huge number of cells. (PAGA is graphbased, Monocle is tree based).
–
 Also, the more cells you have, the more insightful data you can get, but it is often associated with more noise as well. Therefore, the number of cells also impacts how you preprocess your data and how ‘pure’ it will be for trajectory analysis.
When analysing your data, consider the following: Computing power & running time
It doesn’t directly affect your analysis, however do bear in mind that calculations performed during dimensionality reduction, especially on large datasets, can be really timeconsuming. Therefore, you might consider if you won’t be limited by any of those factors.
Trajectory analysis methods used in Galaxy
–
–
 Scanpy DPT
–
–
 Monocle3 (Trajectory Analysis using Monocle3)
–
 Monocle3 in RStudio (coming soon)
–
 scVelo (coming soon)
Trajectory analysis methods used in Galaxy: PAGA (Partitionbased graph abstraction)
.footnote[Wolf et al. 2018]
–
.pullleft[
 PAGA is used to generalise relationships between groups (clusters)
 Provides an interpretable graphlike map. The graph nodes correspond to cell groups and the edge weights quantify the connectivity between groups
 Available within Scanpy
]
Speaker Notes

The clustering step of RaceID identifies larger clusters of different cells by kmeans clustering which is applied to the similarity matrix using the Euclidean metric

Plot comes from Trajectory Analysis using Python (Jupyter Notebook) in Galaxy tutorial where PAGA was used from the Scanpy toolkit
–
.pullright[
]
Speaker Notes
 But PAGA is also available as a tool in Galaxy
Trajectory analysis methods used in Galaxy: Diffusion Pseudotime in Scanpy
–
.pullleft[
 There is another Scanpy tool in Galaxy, which allows you to calculate diffusion pseudotime
 Based on single cell KNN graphs
 Requires to run Scanpy DiffusionMap and Scanpy FindCluster first
]
–
.pullright[
]
Trajectory analysis methods used in Galaxy: RaceID
.footnote[Grün et al. 2015]
– .pullleft[
 StemID is a tool (part of the RaceID package) that aims to derive a hierarchy of the cell types by constructing a cell lineage tree, rooted at the cluster(s) believed to best describe multipotent progenitor stem cells, and terminating at the clusters which describe more mature cell types.
 FateID tries to quantify the cell fate bias a progenitor type might exhibit to indicate which lineage path it will pursue.
 Where StemID utilises a bottomup approach by starting from mature cell types and working up to the multipotent progenitor, FateID uses topdown approach that starts from the progenitor and works its way down.
]
–
.pullright[
]
Speaker Notes
 Specific Trajectory Lineage Analysis (StemID) and Specific Trajectory Fate Analysis (FateID) in Galaxy
 Nicely described in Downstream Singlecell RNA analysis with RaceID tutorial
Trajectory analysis methods used in Galaxy: RaceID
.footnote[Grün et al. 2015]
.pullleft[
 StemID is a tool (part of the RaceID package) that aims to derive a hierarchy of the cell types by constructing a cell lineage tree, rooted at the cluster(s) believed to best describe multipotent progenitor stem cells, and terminating at the clusters which describe more mature cell types.
 FateID tries to quantify the cell fate bias a progenitor type might exhibit to indicate which lineage path it will pursue.
 Where StemID utilises a bottomup approach by starting from mature cell types and working up to the multipotent progenitor, FateID uses topdown approach that starts from the progenitor and works its way down.
]
.pullright[
]
Speaker Notes From the mentioned tutorial:
 StemID Lineage Tree and Branches of significance.
 (TopLeft) Minimum spanning tree showing most likely connections between clusters
 (TopRight) Minimum spanning tree with projected time series
 (BottomLeft) Significance between clusters
 (BottomRight) Link scores between clustercluster pairs
Trajectory analysis methods used in Galaxy: Monocle3
.footnote[C. Trapnell coletrapnelllab]
– .pullleft[
 Monocle introduced the concept of pseudotime, which is a measure of how far a cell has moved through biological progress
 Monocle uses an algorithm to learn the sequence of gene expression changes each cell must go through as part of a dynamic biological process
 General workflow:
 Preprocess the data
 Reduce dimensionality
 Cluster cells
 Learn the trajectory graph
 Order the cells in pseudotime
 All those steps (and more) can be done using Galaxy tools!
]
–
.pullright[
]
Speaker Notes
 Orders cells by learning an explicit principal graph from the single cell transcriptomics data with advanced machine learning techniques (Reversed Graph Embedding), which robustly and accurately resolves complicated biological processes
 Performs clustering (i.e. using tSNE and density peaks clustering), differential gene expression testing, identifies key marker genes to discover and characterise cell types, infers trajectories
 The use of those tools was shown in (Trajectory Analysis using Monocle3 tutorial)
Trajectory analysis methods used in Galaxy: Monocle3
.footnote[C. Trapnell coletrapnelllab]
.pullleft[
 Monocle introduced the concept of pseudotime, which is a measure of how far a cell has moved through biological progress
 Monocle uses an algorithm to learn the sequence of gene expression changes each cell must go through as part of a dynamic biological process
 General workflow:
 Preprocess the data
 Reduce dimensionality
 Cluster cells
 Learn the trajectory graph
 Order the cells in pseudotime
 All those steps (and more) can be done using Galaxy tools!
]
.pullright[
]
Speaker Notes
 Example of Tcells development in pseudotime, from the mentioned Monocle3 tutorial
Trajectory analysis methods used in Galaxy: scVelo
.footnote[Bergen et al. 2020]
–
 scVelo infers genespecific rates of transcription, splicing and degradation, and recovers the latent time of the underlying cellular processes
 think of velocities as the deviation of the observed ratio of spliced and unspliced mRNA from an inferred steady state
 Compatible with scanpy and hosts efficient implementations of all RNA velocity models
.image25[ ]
Key Points
 Trajectory analysis in pseudotime is a powerful way to get insight into the differentiation and development of cells.
 There are multiple methods and algorithms used in trajectory analysis and depending on the dataset, some might work better than others.
 Trajectory analysis is quite sensitive and thus you should always check if the output makes biological sense.
Thank you!
This material is the result of a collaborative work. Thanks to the Galaxy Training Network and all the contributors! Tutorial Content is licensed under Creative Commons Attribution 4.0 International License.References
 Grün, D., A. Lyubimova, L. Kester, K. Wiebrands, O. Basak et al., 2015 Singlecell messenger RNA sequencing reveals rare intestinal cell types. Nature 525: 251–255. 10.1038/nature14966
 Cannoodt, R., W. Saelens, and Y. Saeys, 2016 Computational methods for trajectory inference from singlecell transcriptomics. European Journal of Immunology 46: 2496–2506. 10.1002/eji.201646347
 Haghverdi, L., M. Büttner, F. A. Wolf, F. Buettner, and F. J. Theis, 2016 Diffusion pseudotime robustly reconstructs lineage branching. Nature Methods 13: 845–848. 10.1038/nmeth.3971
 Chen, C.H., T. Xiao, H. Xu, P. Jiang, C. A. Meyer et al., 2018 Improved design and analysis of CRISPR knockout screens (J. Wren, Ed.). Bioinformatics 34: 4095–4101. 10.1093/bioinformatics/bty450
 Wolf, F. A., P. Angerer, and F. J. Theis, 2018 SCANPY: largescale singlecell gene expression data analysis. Genome Biology 19: 10.1186/s1305901713820
 Bergen, V., M. Lange, S. Peidli, F. A. Wolf, and F. J. Theis, 2020 Generalizing RNA velocity to transient cell states through dynamical modeling. Nature Biotechnology 38: 1408–1414. 10.1038/s4158702005913
 Sagar, and D. Grün, 2020 Deciphering Cell Fate Decision by Integrated SingleCell Sequencing Analysis. Annual Review of Biomedical Data Science 3: 1–22. 10.1146/annurevbiodatasci111419091750
 Deconinck, L., R. Cannoodt, W. Saelens, B. Deplancke, and Y. Saeys, 2021 Recent advances in trajectory inference from singlecell omics data. Current Opinion in Systems Biology 27: 100344. 10.1016/j.coisb.2021.05.005
 coletrapnelllab monocle3. https://coletrapnelllab.github.io/monocle3/