CRISPR screen analysis

Overview

Questions:
  • What are the steps to process CRISPR screen data?

  • How to identify differentially enriched guides across multiple experimental conditions?

Objectives:
  • Check quality of raw reads

  • Trim sequencing adapters

  • Count guide sequences in samples

  • Evaluate the quality of count results

  • Test for differential enrichment of guides across conditions

Requirements:
Time estimation: 2 hours
Supporting Materials:
Last modification: Nov 22, 2021
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License The GTN Framework is licensed under MIT

Introduction

The Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) Type II system is a bacterial immune system that has been modified for genome engineering. CRISPR consists of two components: a guide RNA (gRNA) and a non-specific CRISPR-associated endonuclease (Cas9). The gRNA is a short synthetic RNA composed of a scaffold sequence necessary for Cas9-binding (trRNA) and ~20 nucleotide spacer or targeting sequence which defines the genomic target to be modified (crRNA). The ease of generating gRNAs makes CRISPR one of the most scalable genome editing technologies and has been recently utilized for genome-wide screens. Cas9 induces double-stranded breaks (DSB) within the target DNA. The resulting DSB is then repaired by either error-prone Non-Homologous End Joining (NHEJ) pathway or less efficient but high-fidelity Homology Directed Repair (HDR) pathway. The NHEJ pathway is the most active repair mechanism and it results in small nucleotide insertions or deletions (indels) at the DSB site. This results in in-frame amino acid deletions, insertions or frameshift mutations leading to premature stop codons within the open reading frame (ORF) of the targeted gene. Ideally, the end result is a loss-of-function mutation within the targeted gene; however, the strength of the knockout phenotype for a given mutant cell is ultimately determined by the amount of residual gene function. These days, pooled whole-genome knockout, inhibition and activation CRISPR libraries and CRISPR sub-library pools are commonly screened.

Illustration of CRISPR Screen Method.
Figure 1: CRISPR knockout and activation methods (from Joung et al. 2016)

Agenda

In this tutorial, we will cover:

  1. Preparing the reads
    1. Data upload
    2. Raw reads QC
    3. Trim adapters
  2. Counting
  3. Testing
    1. Two conditions

Preparing the reads

Data upload

Here we will demonstrate analysing CRISPR screen using data from Fujihara et al. 2020. We will use FASTQ files containing 1% of reads from the original samples to demonstrate the read processing steps.

hands_on Hands-on: Retrieve CRISPR screen fastq datasets

  1. Create a new history for this tutorial
  2. Import the files from Zenodo:

    • Open the file galaxy-upload upload menu
    • Click on Rule-based tab
    • “Upload data as”: Collection(s)
    • Copy the following tabular data, paste it into the textbox and press Build

      T0-Control https://zenodo.org/api/files/646f76fa-dc6b-404a-9a74-c371a43aacd5/T0-Control.fastq.gz
      T8-APR-246 https://zenodo.org/api/files/646f76fa-dc6b-404a-9a74-c371a43aacd5/T8-APR-246.fastq.gz
      T8-Vehicle https://zenodo.org/api/files/646f76fa-dc6b-404a-9a74-c371a43aacd5/T8-Vehicle.fastq.gz
      

    Rule-based Uploader.

    • From Rules menu select Add / Modify Column Definitions
      • Click Add Definition button and select List Identifier(s): column A

        tip Can’t find List Identifier?

        Then you’ve chosen to upload as a ‘dataset’ and not a ‘collection’. Close the upload menu, and restart the process, making sure you check Upload data as: Collection(s)

      • Click Add Definition button and select URL: column B

    • Click Apply
    • In the Name: box type fastqs and press Upload

    Rule-based Editor.

Raw reads QC

First we’ll check the quality of the raw read sequences with FastQC and aggregate the reports from the multiple samples with MultiQC (Ewels et al. 2016). We will check if the base quality is good and for presence of adapters. For more details on quality control and what the FastQC plots mean see the “Quality control” tutorial.

hands_on Hands-on: Quality control

  1. FastQC Tool: toolshed.g2.bx.psu.edu/repos/devteam/fastqc/fastqc/0.72+galaxy1 with the following parameters:
    • param-collection “Short read data from your current history”: fastqs (click “Dataset collection” button on left-side of this input field)
  2. Inspect the webpage output of FastQC for the T0 sample

    question Questions

    What is the read length?

    solution Solution

    The read length is 75 bp.

  3. MultiQC Tool: toolshed.g2.bx.psu.edu/repos/iuc/multiqc/multiqc/1.9+galaxy1 with the following parameters to aggregate the FastQC reports:
    • In “Results”
      • “Which tool was used generate logs?”: FastQC
      • In “FastQC output”
        • “Type of FastQC output?”: Raw data
        • param-collection “FastQC output”: Raw data files (output of FastQC)
  4. Inspect the webpage output from MultiQC for each FASTQ

    question Questions

    What do you think of the quality of the sequences?

    solution Solution

    The quality seems good for the 3 files.

Trim adapters

We need to trim the adapters to leave just the 20bp guide sequences. We’ll trim these sequences using Cutadapt (Marcel 2011) and its linked adapter format MY_5PRIME_ADAPTER...MY_3PRIME_ADAPTER.

details Adapter trimming

In this dataset the adapters are at different positions in the reads, as shown below.

Adapter sequences dataset.
Figure 2: Reads from the T8-APR-246 sample with one of the guide sequences highlighted in blue (controlguide1 from Brunello library file). The adapter sequences are directly adjacent to the guide on the right and left.

MAGeCK count can trim adapters around the guide sequences. However, the adapters need to be at the same position in each read, requiring the same trimming length, as described on the MAGeCK website here. An example for what MAGeCK expects is shown below. If you used MAGeCK count trimming with the dataset in this tutorial it wouldn’t be able to trim the adapters properly and you would only get ~60% reads mapping instead of >80%.

Adapters MAGeCK can trim.
Figure 3: Example showing what MAGeCK count expects to be able to auto-detect and trim adapters. Guide sequence is in blue, with the adapter sequences directly adjacent on the right and left.

So for this dataset, as the adapters are not at the same position in each read, we need to trim the adapters before running MAGeCK count. To trim, we could run Cutadapt twice, first trimming the 5’ adapter sequence, then trimming the 3’ adapter. Alternatively, we can run Cutadapt just once using the linked adapter format MY_5PRIME_ADAPTER...MY_3PRIME_ADAPTER, as discussed here.

hands_on Hands-on: Trim adapters

  1. Cutadapt Tool: toolshed.g2.bx.psu.edu/repos/lparsons/cutadapt/cutadapt/3.4+galaxy1 with the following parameters:
    • “Single-end or Paired-end reads?”: Single-end
      • param-collection “FASTQ/A file #1”: all fastq.gz files
      • In “Read 1 Options”:
        • In “3’ (End) Adapters”:
          • param-repeat “Insert 3’ (End) Adapters”
            • “Source”: Enter custom sequence
              • “Enter custom 3’ adapter sequence”: TGTGGAAAGGACGAAACACCG...GTTTTAGAGCTAGAAATAGCAAG
    • In “Filter Options”:
      • “Minimum length (R1)”: 20
    • In “Read Modification Options”:
      • “Quality cutoff”: 20
    • “Outputs selector”:
      • “Report”: tick
  2. Inspect the report output from Cutadapt for the T8-APR sample

    question Questions

    What % of reads had adapters?

    solution Solution

    99.9%

Counting

For the rest of the CRISPR screen analysis, counting and testing, we’ll use MAGeCK (Li et al. 2014, Li et al. 2015).

To count how many guides we have for each gene, we need a library file that tells us which guide sequence belongs to which gene. The guides used here are from the Brunello library (Doench et al. 2016) which contains 77,441 sgRNAs, an average of 4 sgRNAs per gene, and 1000 non-targeting control sgRNAs. The library file must be tab-separated and contain no spaces within the target names. If necessary, there are tools in Galaxy that can format the file removing spaces and converting commas to tabs.

hands_on Hands-on: Count guides per gene

  1. Import the library file
    https://zenodo.org/api/files/646f76fa-dc6b-404a-9a74-c371a43aacd5/brunello.tsv
    
    • Copy the link location
    • Open the Galaxy Upload Manager (galaxy-upload on the top-right of the tool panel)

    • Select Paste/Fetch Data
    • Paste the link into the text field

    • Press Start

    • Close the window

    • By default, Galaxy uses the URL as the name, so rename the files with a more useful name.
  2. MAGeCK count Tool: toolshed.g2.bx.psu.edu/repos/iuc/mageck_count/mageck_count/0.5.9.2.4 with the following parameters:
    • Reads Files or Count Table?”: Separate Reads files
      • param-collection “Sample reads: the Read 1 Output (outputs of Cutadapt)
    • param-file “sgRNA library file”: the brunello.tsv file
    • In “Output Options”:
      • “Output Count Summary file”: Yes
      • “Output plots”: Yes
  3. Inspect the Count Summary file

  4. We have been using 1% of reads from the samples. Import the count summary file for the full dataset so you can see what the values look like.
    https://zenodo.org/api/files/646f76fa-dc6b-404a-9a74-c371a43aacd5/kenji_mageck_count_summary.tsv
    

The contents of the count summary file is explained on the MAGeCK website here, also shown below. The columns are as follows. To help you evaluate the quality of the data, recommended values are shown in bold.

Column Content
File The fastq (or the count table) file used.
Label The label of that fastq file assigned.
Reads Total number reads in the fastq file. (Recommended: 100~300 times the number of sgRNAs)
Mapped Total number of reads that can be mapped to library
Percentage Mapped percentage, calculated as Mapped/Reads (Recommended: at least 60%)
TotalsgRNAs Total number of sgRNAs in the library
Zerocounts Total number of missing sgRNAs (sgRNAs that have 0 counts) (Recommended: no more than 1%)
GiniIndex The Gini Index of the read count distribution. A smaller value indicates more eveness of the count distribution. (Recommended: around 0.1 for plasmid or initial state samples, and around 0.2-0.3 for negative selection samples )

question Questions

Is the data quality good for the 3 samples? Use the count summary file for the full dataset, and the recommended values in the table above, to answer these questions.

  1. Have we sequenced enough reads?
  2. Is the mapped percentage good?
  3. Is the sgRNA zero count value good?
  4. Is the Gini Index good?

solution Solution

  1. The number of reads is ok. The lowest number of reads we have for a sample is 17,855,968 (T8-Vehicle), we have 77,441 guides so we have ~230 reads per guide (17,855,968/77,441). A minimum of 100 reads per guide, preferably 300, is recommended.
  2. Yes, it is >80% in all 3 samples.
  3. T0-Control has 0.69% (535/77441 * 100) which is good. The T8 samples are just slightly high at 2.1% (1659/77441 * 100) and 2.7% (2102/77441 * 100).
  4. The Gini Index is 0.09 for T0-Control (initial state) which is good. The T8 samples are 0.12 and 0.13 which is good (not too high) as this is a negative selection experiment.

The paper by Li et al. 2015 has more information on MAGeCK quality control.

Testing

We have been using 1% of reads from the samples in the original dataset to save time as FASTQ files are large. As counts files are small, here we will import and use the MAGeCK counts file generated using all the reads for the samples.

Two conditions

If we want to compare the drug treatment (T8-APR-246) to the vehicle control (T8-Vehicle) we can use MAGeCK test. MAGeCK test uses a robust ranking aggregation (RRA) algorithm (Li et al. 2014).

We could specify them using their names, which must match the names used in the columns of the counts file, but hypens aren’t allowed. We can also specify by their positions in the counts file with the first sample column being 0.

hands_on Hands-on: Test for enrichment

  1. Import the count file from the full dataset Zenodo or the Shared Data library (if available):
    https://zenodo.org/api/files/646f76fa-dc6b-404a-9a74-c371a43aacd5/kenji_mageck_counts.tsv
    
  2. MAGeCKs test Tool: toolshed.g2.bx.psu.edu/repos/iuc/mageck_test/mageck_test/0.5.9.2.1 with the following parameters:
    • param-file “Counts file”: the kenji_mageck_counts.tsv file
    • “Specify Treated samples or Control”: Treated samples
      • “Treated Sample Labels (or Indexes)”: 0
    • “Control Sample Labels (or Indexes)”: 1
    • In “Output Options”:
      • “Output normalized counts file”: Yes
      • “Output plots”: Yes

    details Normalization

    We are using MAGeCK’s default normalization method “median” which is more robust to outliers. Figure M1 from Li et al. 2014 shows a comparison of median (“median”) versus total (“total”) normalization for two CRISPR screen datasets. The distribution of the read counts of significant sgRNAs (FDR=1%) was compared with the mean read count distribution of all sgRNAs (“all”, black). The distribution of the significant sgRNAs should be similar to the distribution of all sgRNAs if the normalization method is unbiased. The difference is small for the leukemia dataset. However, in the melanoma dataset, where a few sgRNAs have very large read counts, the difference is larger, as “total” normalization will prefer sgRNAs with higher read-counts. In contrast, the distribution after “median” normalization is closer to the distribution of all sgRNAs.

    Median versus Total normalization.

  3. Inspect the PDF Report output.

    question Questions

    What are the top 3 genes?

    solution Solution

    ESD, MTHFD1L and SHMT2 which are part of the glutathione pathway. That was the main pathway found to be altered in the published paper for this dataset.

MAGeCK test outputs a gene summary file that contains the columns described below.

Column number Column name Content
1 id Gene ID
2 num The number of targeting sgRNAs for each gene
3 neg|score The RRA lo value of this gene in negative selection
4 neg|p-value The raw p-value (using permutation) of this gene in negative selection
5 neg|fdr The false discovery rate of this gene in negative selection
6 neg|rank The ranking of this gene in negative selection
7 neg|goodsgrna The number of “good” sgRNAs, i.e., sgRNAs whose ranking is below the alpha cutoff (determined by the –gene-test-fdr-threshold option), in negative selection.
8 neg|lfc The log2 fold change of this gene in negative selection. The way to calculate gene lfc is controlled by the –gene-lfc-method option
9 pos|score The RRA lo value of this gene in positive selection
10 pos|p-value The raw p-value (using permutation) of this gene in positive selection
11 pos|fdr The false discovery rate of this gene in positive selection
12 pos|rank The ranking of this gene in positive selection
13 pos|goodsgrna The number of “good” sgRNAs, i.e., sgRNAs whose ranking is below the alpha cutoff (determined by the –gene-test-fdr-threshold option), in positive selection.
14 pos|lfc The log fold change of this gene in positive selection

Genes are ranked by the p.neg field (by default). If you need a ranking by the p.pos, you can use the Sort data in ascending or descending order tool in Galaxy.

We can create a volcano plot to visualise the output, plotting the magnitude of change for drug treatment versus vehicle control (lfc) versus significance (p-value). As we have two columns for lfc and p-value, one for negative selection and one for positive, we first combine these into one column for each using the awk tool. If the neg|p-value is smaller than the pos|p-value the gene is negatively selected. If the neg|p-value is larger than the pos|p-value the gene is positively selected. Then we create the plot using the Volcano plot tool.

hands_on Hands-on: Create volcano plot

  1. Text reformatting Tool: toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_awk_tool/1.1.2 with the following parameters:
    • param-file “File to process”: MAGeCK test Gene Summary
    • “AWK Program”: Copy and paste the text in the grey box below into this field
    # Print new header for first line
    NR == 1 { print "gene", "pval", "fdr", "lfc" }
    
    # Only process lines after first
    NR > 1 {
        # check if neg pval < pos pval
        if ($4 < $10){
           # if it is, print negative selection values
            print $1, $4, $5, $8
        } else {
           # if it's not, print positive selection values
            print $1, $10, $11, $14
        }
    }
    
  2. Inspect the file output. It should look like below.
    gene    pval        fdr       lfc
    ESD     1.7977e-05  0.279703  -1.9404
    MTHFD1L 2.7827e-05  0.279703  -0.90207
    SHMT2   8.9391e-05  0.454208  -1.1307
    FLI1    9.0376e-05  0.454208  -0.085192
    
  3. Volcano Plot Tool: toolshed.g2.bx.psu.edu/repos/iuc/volcanoplot/volcanoplot/0.0.5 to create a volcano plot
    • param-file “Specify an input file”: the Text reformatting output file
    • param-file “File has header?”: Yes
    • param-select “FDR (adjusted P value)”: Column 3
    • param-select “P value (raw)”: Column 2
    • param-select “Log Fold Change”: Column 4
    • param-select “Labels”: Column 1
    • param-select “Points to label”: Significant
      • param-text “Only label top most significant”: 10
  4. Inspect the plot in the PDF output.

    question Questions

    What is the most significant gene?

    solution Solution

    ATP5E as it is the gene nearest the top of the plot.

    Volcano plot.

For more details on using the volcano plot tool, see the tutorial here. For how to customise the volcano plot tool output using R, see the tutorial here.

tip Tip: Getting help

For questions about using Galaxy, you can ask in the Galaxy help forum. For questions about MAGeCK, you can ask in the MAGeCK Google group.

Conclusion

CRISPR Screen reads can be assessed for quality using standard sequencing tools such as FASTQC, MultiQC and trimmed of adapters using Cutadapt. The detection of enriched guides can be performed using MAGeCK.

Key points

  • CRISPR screen data can be analysed using MAGeCK and standard read quality tools

Frequently Asked Questions

Have questions about this tutorial? Check out the FAQ page for the Genome Editing topic to see if your question is listed there. If not, please ask your question on the GTN Gitter Channel or the Galaxy Help Forum

References

  1. Marcel, M., 2011 Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal 17: 10–12. http://journal.embnet.org/index.php/embnetjournal/article/view/200
  2. Li, W., H. Xu, T. Xiao, L. Cong, M. I. Love et al., 2014 MAGeCK enables robust identification of essential genes from genome-scale CRISPR/Cas9 knockout screens. 15: 10.1186/s13059-014-0554-4
  3. Li, W., J. Köster, H. Xu, C.-H. Chen, T. Xiao et al., 2015 Quality control, modeling, and visualization of CRISPR screens with MAGeCK-VISPR. 16: 10.1186/s13059-015-0843-6
  4. Doench, J. G., N. Fusi, M. Sullender, M. Hegde, E. W. Vaimberg et al., 2016 Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9. Nature Biotechnology 34: 184–191. 10.1038/nbt.3437
  5. Ewels, P., M. Magnusson, S. Lundin, and M. Käller, 2016 MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics 32: 3047–3048. https://academic.oup.com/bioinformatics/article/32/19/3047/2196507
  6. Joung, J., S. Konermann, J. S. Gootenberg, O. O. Abudayyeh, R. J. Platt et al., 2016 Protocol: Genome-scale CRISPR-Cas9 Knockout and Transcriptional Activation Screening. 10.1101/059626
  7. Fujihara, K. M., B. Zhang, T. D. Jackson, B. Nijiagel, C.-S. Ang et al., 2020 Genome-wide CRISPR screens reveal APR-246 (Eprenetapopt) triggers ferroptosis and inhibits iron-sulfur cluster biogenesis. 10.1101/2020.11.29.398867

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Click here to load Google feedback frame

Citing this Tutorial

  1. Maria Doyle, Kenji Fujihara, Twishi Gulati, 2021 CRISPR screen analysis (Galaxy Training Materials). https://training.galaxyproject.org/archive/2021-12-01/topics/genome-editing/tutorials/crispr-screen/tutorial.html Online; accessed TODAY
  2. Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

details BibTeX

@misc{genome-editing-crispr-screen,
author = "Maria Doyle and Kenji Fujihara and Twishi Gulati",
title = "CRISPR screen analysis (Galaxy Training Materials)",
year = "2021",
month = "11",
day = "22"
url = "\url{https://training.galaxyproject.org/archive/2021-12-01/topics/genome-editing/tutorials/crispr-screen/tutorial.html}",
note = "[Online; accessed TODAY]"
}
@article{Batut_2018,
    doi = {10.1016/j.cels.2018.05.012},
    url = {https://doi.org/10.1016%2Fj.cels.2018.05.012},
    year = 2018,
    month = {jun},
    publisher = {Elsevier {BV}},
    volume = {6},
    number = {6},
    pages = {752--758.e1},
    author = {B{\'{e}}r{\'{e}}nice Batut and Saskia Hiltemann and Andrea Bagnacani and Dannon Baker and Vivek Bhardwaj and Clemens Blank and Anthony Bretaudeau and Loraine Brillet-Gu{\'{e}}guen and Martin {\v{C}}ech and John Chilton and Dave Clements and Olivia Doppelt-Azeroual and Anika Erxleben and Mallory Ann Freeberg and Simon Gladman and Youri Hoogstrate and Hans-Rudolf Hotz and Torsten Houwaart and Pratik Jagtap and Delphine Larivi{\`{e}}re and Gildas Le Corguill{\'{e}} and Thomas Manke and Fabien Mareuil and Fidel Ram{\'{\i}}rez and Devon Ryan and Florian Christoph Sigloch and Nicola Soranzo and Joachim Wolff and Pavankumar Videm and Markus Wolfien and Aisanjiang Wubuli and Dilmurat Yusuf and James Taylor and Rolf Backofen and Anton Nekrutenko and Björn Grüning},
    title = {Community-Driven Data Analysis Training for Biology},
    journal = {Cell Systems}
}
                

Congratulations on successfully completing this tutorial!