Frequently Asked Questions
I’m using the same training data, tools, and parameters as the tutorial, but I get a different number of transcripts with a significant change in gene expression between the G1E and megakaryocyte cellular states. Why?
Tip: I’m using the same training data, tools, and parameters as the tutorial, but I get a different number of transcripts with a significant change in gene expression between the G1E and megakaryocyte cellular states. Why?
This is okay! Many aspects of the tutorial can potentially affect the exact results you obtain. For example, the reference genome version used and versions of tools. It’s less important to get the exact results shown in the tutorial, and more important to understand the concepts so you can apply them to your own data.
Tip: Are UMIs not actually unique?
Not strictly, but unique enough. The distribution of UMIs should ideally be uniform so that the chance of any two same UMIs capturing the same transcript (via different amplicons) is small. As barcodes have increased in size, the number of UMIs has also increased allowing for UMIs to reach more or less the same numbers of transcripts.
Tip: Why do we do dimension reduction and then clustering? Why not just cluster on the actual data?
The actual data has tens of thousands of genes, and so tens of thousands of variables to consider. Even after selecting for the most variable genes and the most high quality genes, we can still be left with > 1000 genes. Performing clustering on a dataset with 1000s of variables is possible, but computationally expensive. It is therefore better to perform dimension reduction to reduce the number of variables to a latent representation of these variables. These latent variables are ideally more than 10 but less than 50 to capture the variability in the data to perform clustering upon.
Tip: What exactly is a ‘Gene profile’?
Think of it like a fingerprint that some cells exhibit and others don’t. It’s a small collection of genes which are up or down regulated in relation to one another. Their differences are not absolute, but relative. So if CellA has 100 counts of Gene1 and 50 counts of Gene2, this creates a relation of 2:1 between Gene1 and Gene2. If CellB has a 20 counts of Gene1 and 10 counts of Gene2, then they share the same relation. If CellA and CellB share other relations with other genes than this might be enough to say that they share a Gene profile, and will therefore likely cluster together as they describe the same cell type.
Tip: Why do we only consider highly variable genes?
The non-variable genes are likely housekeeping genes, which are expressed everywhere and are not so useful for distinguishing one cell type from another. However background genes are important to the analysis and are used to generate a background baseline model for measuring the variability of the other genes.
Tip: Why is amplification more of an issue in scRNA-seq than RNA-seq?
Due to the extremely small amount of starting material, the initial amplification is likely to be uneven due to the first cycle of amplified products being overrepresented in the second cycle of amplification leading to further bias. In Bulk RNA-seq, the larger selection of RNA molecules to amplify, evens out the odds that any one transcript will be amplified more than others.
Tip: Can RNA-seq techniques be applied to scRNA-seq?
The short answer is ‘no, but yes’. At the beginning this was impossible due to the over-prevalence of dropout events (“zeroes”) in the data complicating the normalisation techniques, but this is not so much of a problem any more with newer methods.
Tip: Could I use a different p-adj value for filtering differentially expressed genes?
Yes, you can modify this value, to perform a more rigorous analysis, or extend the range of genes selected. A higher p-value will significantly increase the number of genes selected, at the expense of including possible false positives.
Tip: Could I use a different tool for performing the quantification step?
There are some alternatives to Salmon for reference transcriptome-based RNA quantification. Kallisto and Sailfish use a similar approach, known as pseudoalignment.