Frequently Asked Questions

Authors:

Deseq2


Regarding DESeq2, in the tutorial you used the normalised count table. Some people use VST normalised counts or rlog normalised counts for visualisation (heatmaps), would you recommend it ? And second question, regarding the heatmap2, I think this depends on the data you analyse but do you have any advise on how to select the clustering method and the distance method?

Tip: Regarding DESeq2, in the tutorial you used the normalised count table. Some people use VST normalised counts or rlog normalised counts for visualisation (heatmaps), would you recommend it ? And second question, regarding the heatmap2, I think this depends on the data you analyse but do you have any advise on how to select the clustering method and the distance method?

this depends on what you would like to do with the table. The DESeq2 wrapper in Galaxy can output all of these, and there is a nice discussion in the DESeq2 vignette about this topic.



Rna-seq


I have RNAseq data for pilot experiment for differential expression in host associated bacterium. One dataset is obtained from bacterial culture, but the other comes from bacteria obtained from the host (plant). I expect strong contamination of the second sample with host RNA reads. Should I filter out reads from the host before performing the analysis (if so, what tools I could use for that), or could I just ignore the contamination (since I will use the bacterial genome to map the reads, it will disregard any host associated reads)?

Tip: I have RNAseq data for pilot experiment for differential expression in host associated bacterium. One dataset is obtained from bacterial culture, but the other comes from bacteria obtained from the host (plant). I expect strong contamination of the second sample with host RNA reads. Should I filter out reads from the host before performing the analysis (if so, what tools I could use for that), or could I just ignore the contamination (since I will use the bacterial genome to map the reads, it will disregard any host associated reads)?

You could map both sets of reads to the reference genome. You are right - no pre-filtering required as the host reads shouldn’t map to the ref and will be excluded.



Estimation of strandedness


Infer experiment, is it ever used in practice? I mean, most often you are aware if the RNA-seq data is stranded or not in the first place, right, because you sequenced it yourself or ordered it from a company.

Tip: Infer experiment, is it ever used in practice? I mean, most often you are aware if the RNA-seq data is stranded or not in the first place, right, because you sequenced it yourself or ordered it from a company.

This can happen in cases where you get the data from someone else, and they don’t know.

I am trying to check the strandedness of my libraries and I get unequal numbers in the infer experiments, but in the IGV it looks like it is unstranded. What does this mean?

Tip: I am trying to check the strandedness of my libraries and I get unequal numbers in the infer experiments, but in the IGV it looks like it is unstranded. What does this mean?

It’s also often the case that elimination of the second strand is not perfect, and there are genuine cases of bidirectional transcription in the genome. 70 / 30 % as in your report is not a good result for a stranded library. You can treat this as a stranded library in your analysis, but for instance you couldn’t make the conclusion that a given gene is actually transcribed from the reverse strand. Likely that the library preparation didn’t work perfectly. This can depend on many factors, one is that you need to completely digest your DNA using a high quality DNase before doing the reverse transcription.



Mapping


Is it possible to visualize the RNA STAR bam file using the JBrowse tool?

Tip: Is it possible to visualize the RNA STAR bam file using the JBrowse tool?

Yes, that should work.

I am using the RNA STAR tool to map my reads on the reference genome. while filling in the details for , I have to fill in "Length of the genomic sequence around annotated junctions", which apparently has to be 36. I'm lost for a moment why this is 36, what is meant by it, and why is it relevant? Does anyone have any ideas? Why should it be the length of the reads -1?

Tip: I am using the RNA STAR tool to map my reads on the reference genome. while filling in the details for , I have to fill in “Length of the genomic sequence around annotated junctions”, which apparently has to be 36. I’m lost for a moment why this is 36, what is meant by it, and why is it relevant? Does anyone have any ideas? Why should it be the length of the reads -1?

RNA STAR is using the gene model to create the database of splice junctions, and that these don’t “need” to have a length longer than the reads (37bp).