The Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) system is a bacterial immune system that has been modified for genome engineering. This groundbreaking technology resulted in a Nobel Prize for Emmanuelle Charpentier and Jennifer Doudna in 2020 (Uyhazi and Bennett 2021). CRISPR consists of two components: a guide RNA (gRNA) and a non-specific CRISPR-associated endonuclease (Cas9). The gRNA is a short synthetic RNA composed of a scaffold sequence necessary for Cas9-binding (trRNA) and ~20 nucleotide spacer or targeting sequence which defines the genomic target to be modified (crRNA). Cas9 induces double-stranded breaks (DSB) within the target DNA. The resulting DSB is then repaired by either error-prone Non-Homologous End Joining (NHEJ) pathway or less efficient but high-fidelity Homology Directed Repair (HDR) pathway. The NHEJ pathway is the most active repair mechanism and it leads to small nucleotide insertions or deletions (indels) at the DSB site. This results in in-frame amino acid deletions, insertions or frameshift mutations leading to premature stop codons within the open reading frame (ORF) of the targeted gene. Ideally, the end result is a loss-of-function mutation within the targeted gene; however, the strength of the knockout phenotype for a given mutant cell is ultimately determined by the amount of residual gene function.
The ease of generating gRNAs makes CRISPR one of the most scalable genome editing technologies and it has been recently utilized for genome-wide screens. These screens enable systematic targeting of 1000s of genes, with one gene targeted per cell, to identify genes driving phenotypes, such as cell survival, drug resistance or sensitivity. It is feasible for any laboratory to perform a CRISPR screen (Cluse et al. 2018) and they are being increasingly used to obtain biological insight (Bock et al. 2022, Przybyla and Gilbert 2021). These days, pooled whole-genome knockout, inhibition and activation CRISPR libraries and CRISPR sub-library pools are commonly screened.
Here we will demonstrate analysing CRISPR screen using data from Fujihara et al. 2020. There are 3 samples from the human esophageal cancer cell line (OACM5.1): a baseline sample taken at time zero (T0-Control), a sample treated with drug for 8 days (T8-APR-246) and a control sample treated with vehicle for 8 days (T8-Vehicle). The aim is to identify genes whose knockout increases the cancer cells sensitivity to the drug. We will use FASTQ files containing 1% of reads from the original samples to demonstrate the read processing steps.
Hands-on: Retrieve CRISPR screen fastq datasets
Create a new history for this tutorial
Import the files from Zenodo:
Open the file galaxy-uploadupload menu
Click on Rule-based tab
“Upload data as”: Collection(s)
Copy the following tabular data, paste it into the textbox and press Build
From Rules menu select Add / Modify Column Definitions
Click Add Definition button and select List Identifier(s): column A
Then you’ve chosen to upload as a ‘dataset’ and not a ‘collection’. Close the upload menu, and restart the process, making sure you check Upload data as: Collection(s)
Click Add Definition button and select URL: column B
Click Apply
In the Name: box type fastqs and press Upload
Raw reads QC
First we’ll check the quality of the raw read sequences with FastQC and aggregate the reports from the multiple samples with MultiQC (Ewels et al. 2016). We will check if the base quality is good and for presence of adapters. For more details on quality control and what the other FastQC plots mean see the “Quality control” tutorial.
With CRISPR screens we expect adapter sequence to be present, surrounding the guide. The adapters for the Brunello library vector used with this dataset (lentiGuide-Puro) are shown below. We’ll get FASTQC to check for their presence by providing an adapters file with their sequences.
Figure 2: Reads from the T8-APR-246 sample with one of the guide sequences highlighted in blue (controlguide1 from Brunello library file). The adapter sequences are in red, directly adjacent to the guide on the right and left. The bases in purple at the start of the reads are stagger regions of different length (0-8bp), required to maintain diversity across the flowcell during sequencing of CRISPR screens.
The adapters file (adapter_list.tsv) we use here with FASTQC was created by adding the two CRISPR adapter sequences (5’ and 3’ of the guide) to the bottom of the adapters file used by FASTQC. The first 12 bases of the CRISPR adapters were used, as that length is currently recommended in the FASTQC adapters file.
Hands-on: Quality control
Import the adapters file from Zenodo or the Shared Data library (if available):
Click galaxy-uploadUpload Data at the top of the tool panel
Select galaxy-wf-editPaste/Fetch Data
Paste the link(s) into the text field
Press Start
Close the window
FastQC ( Galaxy version 0.73+galaxy0) with the following parameters:
param-collection“Short read data from your current history”: fastqs (click “Dataset collection” button on left-side of this input field)
param-file“Adapter list”: adapter_list.tsv
MultiQC ( Galaxy version 1.11+galaxy0) with the following parameters to aggregate the FastQC reports:
In “Results”
“Which tool was used generate logs?”: FastQC
In “FastQC output”
“Type of FastQC output?”: Raw data
param-collection“FastQC output”: Raw data files (output of FastQC)
Add a tag (#fastqc-untrimmed) to the MultiQC webpage to make it easy to differentiate this dataset in the History from other MultiQC reports we will generate.
Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.
To tag a dataset:
Click on the dataset to expand it
Click on Add Tagsgalaxy-tags
Add tag text. Tags starting with # will be automatically propagated to the outputs of tools using this dataset (see below).
Press Enter
Check that the tag appears below the dataset name
Tags beginning with # are special!
They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):
a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;
dataset 3 is used to calculate read coverage using BedTools Genome Coverageseparately for + and - strands. This generates two datasets (4 and 5 for plus and minus, respectively);
datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;
datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.
Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.
The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with #plus and #minus, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.
What do you think of the base quality of the sequences?
Are adapters detected?
The read length is 75 bp.
The base quality is good for the 3 files, as can be seen in the plot below.
Yes. In the Adapter Content plot (below) we can see the two CRISPR adapters have been detected. The adapter 5 prime of the sgRNA starts appearing at the beginning of the reads and has been detected in nearly 100% of reads. The adapter 3 prime of the sgRNA appears after base 40. There’s also a little Illumina Universal adapter detected towards the ends of some reads which indicate the presence of short fragments, possibly adapter dimers.
Trim adapters
We’ll trim the adapters from these sequences using Cutadapt (Marcel 2011). To trim, we’ll use the 5’ adapter sequence. We don’t need to trim the 3’ adapter as Model-based Analysis of Genome-wide CRISPR-Cas9 Knockout (MAGeCK) will only use the first 20bp from each read. It determines the number of bases to use automatically from the length of the sequences in the library file, in our case 20bp, or we can specify the sgRNA length.
In this dataset the adapters start at different positions in the reads, as was shown above. MAGeCK count can trim adapters around the guide sequences. However, the adapters need to start at the same position in each read, requiring the same trimming length, as described on the MAGeCK website. An example for what MAGeCK expects is shown below. If you used MAGeCK count trimming with the dataset in this tutorial it wouldn’t be able to trim the 5’ adapter properly and you would only get ~60% reads mapping instead of >80%.
Figure 3: Example showing what MAGeCK count expects to be able to auto-detect and trim adapters. Guide sequence is higlighted in blue with the adapter sequences directly adjacent on the right and left. MAGeCK count uses the first 20 bases of each read to map so the sequence after the guide is less important to trim exactly.
For this dataset, as the adapters are not at the same position in each read, we need to trim the adapters before running MAGeCK count.
Hands-on: Trim adapters
Cutadapt ( Galaxy version 3.5+galaxy0) with the following parameters:
“Single-end or Paired-end reads?”: Single-end
param-collection“FASTQ/A file #1”: all fastq.gz files
MultiQC ( Galaxy version 1.11+galaxy0) with the following parameters to aggregate the Cutadapt reports:
In “Results”
“Which tool was used generate logs?”: Cutadapt/Trim Galore!
param-collection“Output of Cutadapt”: Report files (output of Cutadapt)
Add a tag (#cutadapt-report) to the MultiQC webpage to differentiate this report from the previous (FastQC) one
Inspect the MultiQC report
MultiQC produces a 5’ trimmed sequences plot where we can check the results are as expected. Here we can see the length of sequence trimmed from the start of the read ranges from 22bp to 30bp with a dip at 27bp. This corresponds to the length of the adapter (22bp) plus stagger sequence (0,1,2,3,4,6,7,8bp) (see sequencing protocol). There is no 5bp stagger sequence so 27bp sequences are not expected to be trimmed. The trimmed sequence lengths are what we expect for this dataset and the plot looks similar for all our samples which is good.
Hands-on: Exercise: Quality control of the polished datasets
Use FastQC ( Galaxy version 0.73+galaxy0) and MultiQC ( Galaxy version 1.11+galaxy0) like before, but using the the trimmed datasets produced by Cutadapt as input.
Question
How did read trimming affect the Adapter Content plot?
In the Adapter Content section we now don’t have any 5’ adapter detected. The first 20bp of the reads, what MAGeCK will use for our dataset, has no adapter detected.
Counting
For the rest of the CRISPR screen analysis, counting and testing, we’ll use the tool called MAGeCK (Li et al. 2014, Li et al. 2015).
To count how many guides we have for each gene, we need a library file that tells us which guide sequence belongs to which gene. The guides used here are from the Brunello library (Doench et al. 2016) which contains 77,441 sgRNAs, an average of 4 sgRNAs per gene, and 1000 non-targeting control sgRNAs. The library file must be tab-separated and contain no spaces within the gene or target names. If necessary, there are tools in Galaxy that can format the file removing spaces and converting commas to tabs.
MAGeCK count ( Galaxy version 0.5.9.2.4) with the following parameters:
“Reads Files or Count Table?”: Separate Reads files
param-collection“Sample reads”: the Read 1 Output (outputs of Cutadapt)
param-file“sgRNA library file”: the brunello.tsv file
In “Output Options”:
“Output Count Summary file”: Yes
“Output plots”: Yes
We have been using 1% of reads from the samples. Import the MAGeCK count files (sgRNA counts, counts summary and plots pdf) for the full dataset so you can see what results for a real dataset looks like.
The contents of the count summary file is explained on the MAGeCK website, also shown below. The columns are as follows. To help you evaluate the quality of the data, recommended values from the MAGeCK authors are shown in bold.
Column
Content
File
The fastq (or the count table) file used.
Label
The label of that fastq file assigned.
Reads
Total number reads in the fastq file. (Recommended: 100~300 times the number of sgRNAs)
Mapped
Total number of reads that can be mapped to library
Percentage
Mapped percentage, calculated as Mapped/Reads (Recommended: at least 60%)
TotalsgRNAs
Total number of sgRNAs in the library
Zerocounts
Total number of missing sgRNAs (sgRNAs that have 0 counts) (Recommended: no more than 1%)
GiniIndex
The Gini Index of the read count distribution. A smaller value indicates more eveness of the count distribution. (Recommended: around 0.1 for plasmid or initial state samples, and around 0.2-0.3 for negative selection samples)
Gini index is a measure of inequality from economics. It is used in CRISPR analysis to assess if sgRNAs are present in equal amounts. In positive selection experiments, where only some sgRNAs dominate, the index can be high. However, as discussed in Li et al. 2015, in plasmid library, in early time points, or negative selection experiments, we expect fairly even
distribution of the remaining sgRNAs that haven’t been negatively selected. A high Gini index in these types of sample can indicate CRISPR oligonucleotide synthesis unevenness, low viral transfection efficiency, and overselection, respectively.
Question
Is the data quality good for the 3 samples? Use the count summary file for the full dataset, and the recommended values in the table above, to answer these questions.
Have we sequenced enough reads?
Is the mapped percentage good?
Is the sgRNA zero count value good?
Is the Gini Index good?
The number of reads is ok. For example, for T0 control sample we have 17,272,052 reads mapped to guides. We have 77,441 guides so we have ~220 reads per guide (17,272,052/77,441). A minimum of 100 reads per guide, preferably 300, is recommended.
Yes, in the summary we have >85% mapped for all 3 samples. MAGeCK count does not allow any base mismatches between the reads and the library file, as described so we expect not all reads will map. Note that we filtered out (6-10%) reads with Cutadapt so we should include those in our unmapped % if we want an accurate count.
T0-Control has 0.71% (546/77441 * 100) sgRNAs that have no reads mapped, which is good. The T8 samples are just slightly high at 2.3% (1752/77441 * 100) and 2.8% (2170/77441 * 100).
The Gini Index is 0.09 for T0-Control (initial state) which is good. The T8 samples are higher at 0.13 and 0.14 but good for a negative selection experiment.
MAGeCK count can also generate a PDF with plots that can help assess quality.
Hands-on: Assess mageck count plots
Inspect the PDF we imported above.
Question
What can you determine about the samples from the plots?
In the boxplots, we can see we have largely similiar distributions of counts for the 3 samples. The greater length of the box and between whiskers in the T8 samples compared to the control tells us we have a bit more variability of counts in those samples, more guides with low and high counts in T8 compared to the T0.
The Distribution of read counts plot shows us how many guides we have for each count (the frequency sums to the total no. of guides 77,441). Similar to the boxplots, the wider distribution for T8 compared to T0 shows us that those samples have more guides with low and high counts. The peak for T0 appears lower just because it has more points (bins) in the plot.
The PCA plot shows us the samples from the different conditions separate well. The T8 samples are a bit more similar to each other, to each other on PC1 axis than to T0. If we had more samples we could use this plot to check for clustering of replicates, batch effect or outliers.
The hierarchical clustering plot also shows us that the T8 samples are a bit more similar to each other than to T0.
The paper by Li et al. 2015 has more information on MAGeCK quality control.
Testing
Now that we’ve generated our guide counts, we’ll use MAGeCK test to identify essential genes. Essential means positively or negatively selected sgRNAs and genes. CRISPR positive or negative selection screens can be performed. With a positive selection screen, most cells die after the treatment (selection) and we are interested in identifying genes whose sgRNAs increase and dominate, indicating loss of those genes helps cells survive that treatment. This can help identify genes essential for drug resistance. With a negative selection screen, most cells survive after the treatment. In that case, we are interested in identifying genes whose sgRNAs decrease (drop out) compared to a control (e.g. vehicle), indicating those genes are needed for the cells to survive with that treatment. This can help identify genes essential for drug sensitivity. Regardless of the type of screen performed (positive or negative), MAGeCK can identify both positively and negatively selected genes in the screen (Li et al. 2014). The dataset we are using in this tutorial is from a negative selection screen where the aim is to identify genes whose knockout increases the cancer cells sensitivity to the drug.
If we want to compare the drug treatment (T8-APR-246) to the vehicle control (T8-Vehicle) we can use MAGeCK test. MAGeCK test uses a robust ranking aggregation (RRA) algorithm (Li et al. 2014).
Figure 5: Overview of the MAGeCK algorithm. Raw read counts corresponding to single-guided RNAs (sgRNAs) from different experiments are first normalized using median normalization and mean-variance modeling is used to capture the relationship of mean and variance in replicates. The statistical significance of each sgRNA is calculated using the learned mean-variance model. Essential genes (both positively and negatively selected) are then identified by looking for genes whose sgRNAs are ranked consistently higher (by significance) using robust rank aggregation (RRA) (from Li et al. 2014)
Hands-on: Test for enrichment
MAGeCKs test ( Galaxy version 0.5.9.2.1) with the following parameters:
param-file“Counts file”: kenji_mageck_sgRNA_counts.tsv(the counts file for the full dataset that we imported)
“Specify Treated samples or Control”: Treated samples
“Treated Sample Labels (or Indexes)”: 0
“Control Sample Labels (or Indexes)”: 1
In “Output Options”:
“Output normalized counts file”: Yes
“Output plots”: Yes
For MAGeCK test we could specify the samples using their names, which must match the names used in the columns of the counts file, but the hyphens we have in our sample names aren’t allowed. We could change the hyphens in the counts file or we can specify the samples by their positions in the counts file, with the first sample column being 0. We do that here.
Comment: Replicates
If we have biological and/or technical replicates we can handle them in a similar way to that described on the MAGeCK website.
For biological replicates, we input them in MAGeCK test Treated Sample Labels/Control Sample Labels fields separated by a comma.
For technical replicates, we could combine the fastqs for each sample/biological replicate, for example with the Concatenate datasets tool, before running MAGeCK count.
We are using MAGeCK’s default normalization method “median” which is more robust to outliers.
Figure M1 from Li et al. 2014 shows a comparison of median (“median”) versus total (“total”) normalization for two CRISPR screen datasets.
The distribution of the read counts of significant sgRNAs (FDR=1%) was compared with the mean read count distribution of all sgRNAs (“all”, black). The distribution of the significant sgRNAs should be similar to the distribution of all sgRNAs if the normalization method is unbiased. The difference is small for the leukemia dataset. However, in the melanoma dataset, where a few sgRNAs have very large read counts, the difference is larger, as “total” normalization will prefer sgRNAs with higher read-counts. In contrast, the distribution after “median” normalization is closer to the distribution of all sgRNAs.
If many genes have zero counts, such as after treatment in a positive selection screen, the median may be zero and we may get a message in the MAGeCK log that we have to use total normalization.
Instead of median or total, we could choose to normalize using control guide sgRNAs. However, we would need to know that they’re not changing due to the experiment or introducing bias (Chen et al. 2018).
MAGeCK test outputs:
a Gene Summary file
a sgRNA Summary file
a PDF report
Gene Summary file
The Gene Summary file contains the columns described below and a row for each gene targeted by sgRNAs. We have >20,000 genes in the file for this dataset.
We get values for both negative and positive selection. The dataset here is from a negative selection screen so we are most interested in the negative values.
Genes are ranked by the p.neg field (by default). If you need a ranking by the p.pos, you can use the Sort data in ascending or descending order tool in Galaxy.
Column name
Content
id
Gene ID
num
The number of targeting sgRNAs for each gene
neg|score
The RRA lo value of this gene in negative selection
neg|p-value
The raw p-value (using permutation) of this gene in negative selection
neg|fdr
The false discovery rate of this gene in negative selection
neg|rank
The ranking of this gene in negative selection
neg|goodsgrna
The number of “good” sgRNAs, i.e., sgRNAs whose ranking is below the alpha cutoff (determined by the –gene-test-fdr-threshold option), in negative selection.
neg|lfc
The log2 fold change of this gene in negative selection. The way to calculate gene lfc is controlled by the –gene-lfc-method option
pos|score
The RRA lo value of this gene in positive selection
pos|p-value
The raw p-value (using permutation) of this gene in positive selection
pos|fdr
The false discovery rate of this gene in positive selection
pos|rank
The ranking of this gene in positive selection
pos|goodsgrna
The number of “good” sgRNAs, i.e., sgRNAs whose ranking is below the alpha cutoff (determined by the –gene-test-fdr-threshold option), in positive selection.
pos|lfc
The log fold change of this gene in positive selection
sgRNA summary file
The sgRNA Summary file contains the columns described below. We can use the sgRNA file to check how the individual guides for genes of interest performed.
Column
Content
sgrna
sgRNA ID
Gene
The targeting gene
control_count
Normalized read counts in control samples
treatment_count
Normalized read counts in treatment samples
control_mean
Median read counts in control samples
treat_mean
Median read counts in treatment samples
LFC
The log2 fold change of sgRNA
control_var
The raw variance in control samples
adj_var
The adjusted variance in control samples
score
The score of this sgRNA
p.low
p-value (lower tail)
p.high
p-value (higher tail)
p.twosided
p-value (two sided)
FDR
false discovery rate
high_in_treatment
Whether the abundance is higher in treatment samples
PDF report
The PDF shows plots of the top 10 negatively and positively selected genes.
We can see the top genes ranked by RRA scores or p value. These values come from the gene summary file.
Question
What are the top 3 negatively selected genes by P value?
ESD, MTHFD1L and SHMT2, the same as by RRA score. RRA score is like p-value. Those genes are part of the glutathione pathway which was found to be altered in the published paper for this dataset.
The PDF also shows plots with the sgRNA counts for the top 10 genes. These values are the normalized counts for each sgRNA from the sgRNA summary file. With these plots we can see if the counts of all the sgRNAs for these top genes are changing similarly.
Question
Are all the sgRNAs for FLI1 (the 4th top neg gene) changing similarly?
In the gene summary table, how many of the negatively selected genes have FDR < 0.05?
No. We can see in this case that, while one sgRNA is a lot lower in the APR treated sample compared to the vehicle, one increases a little, and the other two sgRNAs don’t change much. So we might conclude that this gene is not strongly negatively selected.
In addition to the visualisations automatically generated by MAGeCK in the PDF, we can create a volcano plot to further visualise the output. With a volcano plot we plot the magnitude of change for drug treatment versus vehicle control (lfc) versus significance (p-value). As we have two columns for lfc and p-value, one for negative selection and one for positive, we first combine these into one column for each using the awk tool. If the neg|p-value is smaller than the pos|p-value the gene is negatively selected. If the neg|p-value is larger than the pos|p-value the gene is positively selected. Then we create the plot using the Volcano plot tool.
Hands-on: Create volcano plot
Text reformatting with awk ( Galaxy version 1.1.2) with the following parameters:
param-file“File to process”: MAGeCK test Gene Summary
“AWK Program”: Copy and paste the text in the grey box below into this field
# Print new header for first line
NR == 1 { print "gene", "pval", "fdr", "lfc" }
# Only process lines after first
NR > 1 {
# check if neg pval (column 4) is less than pos pval (column 10)
if ($4 < $10){
# if it is, print negative selection values
print $1, $4, $5, $8
} else {
# if it's not, print positive selection values
print $1, $10, $11, $14
}
}
Inspect the file output. It should look like below.
Volcano Plot ( Galaxy version 0.0.5) to create a volcano plot
param-file“Specify an input file”: the Text reformatting output file
param-file“File has header?”: Yes
param-select“FDR (adjusted P value)”: Column 3
param-select“P value (raw)”: Column 2
param-select“Log Fold Change”: Column 4
param-select“Labels”: Column 1
param-select“Points to label”: Significant
param-text“Only label top most significant”: 10
Inspect the plot in the PDF output.
Question
What is the most significant gene?
ATP5E as it is the gene nearest the top of the plot.
For more details on using the volcano plot tool, see the tutorial here. For how to customise the volcano plot tool output using R, see the tutorial here.
Pathway analysis
We can perform pathway analysis on the results to identify pathways that are changing with the treatment. MAGeCK has a pathway analysis module, however, we will use the tool fgsea as it outputs some visualisations and the genes enriched in the pathway. For fgsea we need a ranked list of genes and a pathways file in GMT format. We can get the ranked list from our mageck results, we’ll use the RRA score column for negative results. For the pathways, we’ll use the Hallmark pathways from MSigDB, which can be a good starting point for pathway exploration.
Hands-on: Perform gene set enrichment with fgsea
Import the Hallmark pathways file Zenodo or the Shared Data library (if available). Set the Type to tabular:
It takes about 30 mins to run mageck mle on the samples below so this part of the tutorial is optional, you can try it if time allows or in your own time.
If we have more than two conditions to compare, or a complex experimental design, we can use MAGeCK mle. MAGeCK mle uses a maximum likelihood estimation (MLE) algorithm (Li et al. 2015). It outputs a single value (beta score) per gene instead of a score for both negative and positive selection. A negative beta score indicates negative selection and a positive indicates positive selection. MAGeCK mle can also be used for comparing 2 conditions instead of MAGeCK test (RRA) but it is slower.
To demonstrate using MAGeCK mle, we will compare the drug treatment (T8-APR-246) to T0 and the vehicle (T8-Vehicle) to T0.
We’ll use a design matrix file.
The rules of the design matrix are:
The design matrix file must include a header line of condition labels
The first column is the sample labels that must match labels in read count file
The second column must be a “baseline” column that sets all values to “1”
The value in the design matrix is either “0” or “1”
You must have at least one sample of “initial state” (e.g., day 0 or plasmid) that has only one “1” in the corresponding row. That only “1” must be in the baseline column
The initial state (baseline) sample must be the 1st sample listed, under the header row, for mageck mle
The file for our samples looks like below. We have 3 conditions: baseline (T0), Vehicle and APR. The 1st row indicates that T0 is our baseline sample, the 2nd row indicates that we want to compare Vehicle to T0, and the 3rd row indicates that we want to compare APR to T0.
The beta score of this gene in condition (e.g. Vehicle or APR). The conditions are specified in the design matrix input to mle.
Condition_name|p-value
The raw p-value (using permutation) of this gene
Condition_name|fdr
The false discovery rate of this gene
Condition_name|z
The z-score associated with Wald test
Condition_name|wald-p-value
The p value using Wald test
Condition_name|wald-fdr
The false discovery rate of the Wald test
Similar to what we did with the MAGeCK test output, we can create a volcano plot to visualise the results, showing the magnitude of change (beta score) and significance (p-value).
Hands-on: Visualize MLE results
Create a volcano plot to visualise the result for APR vs T0. Use the gene summary file columns GeneAPR|betaAPR|wald-p-valueAPR|wald-fdr
Volcano Plot ( Galaxy version 0.0.5) to create a volcano plot
param-file“Specify an input file”: the MAGeCK mle Gene Summary file
param-select“FDR (adjusted P value)”: Column 14
param-select“P value (raw)”: Column 13
param-select“Log Fold Change”: Column 9
param-select“Labels”: Column 1
param-select“Points to label”: Significant
param-text“Only label top most significant”: 10
Similarly, you could create a plot for the vehicle vs T0 using the gene summary file columns GeneVehicle|betaVehicle|wald-p-valueVehicle|wald-fdr
CRISPR Screen reads can be assessed for quality using standard sequencing tools such as FASTQC, MultiQC and trimmed of adapters using Cutadapt. The detection of enriched guides can be performed using MAGeCK. Downstream analysis can include visualisations, such as volcano plot, and pathway analysis with tools like fgsea.
Acknowledgements
Thanks to Mehmet Tekman for suggesting the awk tool. Thanks also to Jennifer Devlin, Lydia Lim and Sylvia Mahara for comments and feedback on the tutorial.
You've Finished the Tutorial
Please also consider filling out the Feedback Form as well!
Key points
CRISPR screen data can be analysed using MAGeCK and standard read quality tools
Li, W., H. Xu, T. Xiao, L. Cong, M. I. Love et al., 2014 MAGeCK enables robust identification of essential genes from genome-scale CRISPR/Cas9 knockout screens. 15: 10.1186/s13059-014-0554-4
Li, W., J. Köster, H. Xu, C.-H. Chen, T. Xiao et al., 2015 Quality control, modeling, and visualization of CRISPR screens with MAGeCK-VISPR. 16: 10.1186/s13059-015-0843-6
Doench, J. G., N. Fusi, M. Sullender, M. Hegde, E. W. Vaimberg et al., 2016 Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9. Nature Biotechnology 34: 184–191. 10.1038/nbt.3437
Joung, J., S. Konermann, J. S. Gootenberg, O. O. Abudayyeh, R. J. Platt et al., 2016 Protocol: Genome-scale CRISPR-Cas9 Knockout and Transcriptional Activation Screening. 10.1101/059626
Chen, S., Y. Zhou, Y. Chen, and J. Gu, 2018 fastp: an ultra-fast all-in-one FASTQ preprocessor. 10.1093/bioinformatics/bty560
Cluse, L. A., I. Nikolic, D. Knight, P. B. Madhamshettiwar, J. Luu et al., 2018 A Comprehensive Protocol Resource for Performing Pooled shRNA and CRISPR Screens, pp. 201–227 inMethods in Molecular Biology, Springer New York. 10.1007/978-1-4939-7568-6_17
Fujihara, K. M., B. Zhang, T. D. Jackson, B. Nijiagel, C.-S. Ang et al., 2020 Genome-wide CRISPR screens reveal APR-246 (Eprenetapopt) triggers ferroptosis and inhibits iron-sulfur cluster biogenesis. 10.1101/2020.11.29.398867
Przybyla, L., and L. A. Gilbert, 2021 A new era in functional genomics screens. Nature Reviews Genetics. 10.1038/s41576-021-00409-w
Uyhazi, K. E., and J. Bennett, 2021 A CRISPR view of the 2020 Nobel Prize in Chemistry. Journal of Clinical Investigation 131: 10.1172/jci145214
Bock, C., P. Datlinger, F. Chardon, M. A. Coelho, M. B. Dong et al., 2022 High-content CRISPR screening. Nature Reviews Methods Primers 2: 10.1038/s43586-021-00093-4
Glossary
CRISPR
Clustered Regularly Interspaced Short Palindromic Repeats
MAGeCK
Model-based Analysis of Genome-wide CRISPR-Cas9 Knockout
Feedback
Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012
@misc{genome-annotation-crispr-screen,
author = "Maria Doyle and Kenji Fujihara and Twishi Gulati",
title = "CRISPR screen analysis (Galaxy Training Materials)",
year = "",
month = "",
day = "",
url = "\url{https://training.galaxyproject.org/training-material/topics/genome-annotation/tutorials/crispr-screen/tutorial.html}",
note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
doi = {10.1371/journal.pcbi.1010752},
url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
year = 2023,
month = {jan},
publisher = {Public Library of Science ({PLoS})},
volume = {19},
number = {1},
pages = {e1010752},
author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
editor = {Francis Ouellette},
title = {Galaxy Training: A powerful framework for teaching!},
journal = {PLoS Comput Biol}
}
Congratulations on successfully completing this tutorial!
You can use Ephemeris's shed-tools install command to install the tools used in this tutorial.
4 stars:
Liked: A demo with a cripsr pool set who works (so many online pool screen CRIPSR tools do not work). so many command based installation do not work. Simply starting with MAGECK installation on python is just a nightmare, it does not install smoothly, and it seems thatthe original authors move on and do not care at deploying this smoothly to the community(only expert can use these code lines). I really appreciate the work here. It seems more applicable from any machine without all the dependancies, environments, expertise in coding and languages bug......
Disliked: Should make mageck available in global galaxy and not just australia one. Could provide other datasets such as cripsrI or cripsA pool set to extend the range of trial for users to get accustomed with the pipeline.
March 2022
5 stars:
Liked: everything because all was new for me
Disliked: Add more information in the lecture (describe more meticulously the CRISPR analysis)
5 stars:
Liked: It was super clear, and combine nice tools
Disliked: For me was not "intuitive" to have a test with two p-values (for negative/positive selection). That I do not know if it is pretty common in transcriptomic (and I missed it) or if it is non so commonly used and it was speciffically developed on MAGeCK test. If it is not commonly use I think that can be nice to include a brief description of the statistical test in the tutorial of the course.