Single-cell Assay for Transposase-Accessible Chromatin using sequencing (scATAC-seq) analysis is a method to decipher the chromatin states of the analyzed cells. In general, genes are only expressed in accessible (i.e. âopenâ) chromatin and not in closed chromatin.
By analyzing which genomic sites have an open chromatin state, cell-type specific patterns of gene accessibility can be determined.
scATAC-seq is particularly useful for analyzing tissue containing different cell populations, such as peripheral blood mononuclear cells (PBMCs).
In this tutorial we will analyze scATAC-seq data using the tool suites SnapATAC2 (Zhang et al. 2024) and Scanpy (Wolf et al. 2018).
With both of these tool suites we will perform preprocessing, clustering and identification of scATAC-seq datasets from 10x Genomics.
The analysis will be performed using a dataset of PBMCs containing ~4,620 single nuclei.
Did you know we have a unique Single Cell Omics Lab with all our single cell tools highlighted to make it easier to use on Galaxy? We recommend this site for all your single cell analysis needs, particularly for newer users.
The Single Cell Omics Lab is a different view of the underlying Galaxy server that organises tools and resources better for single-cell users! It also provides a platform for communities to engage and connect; distribute more targeted news and events; and highlight community-specific funding sources.
Tools are frequently updated to new versions. Your Galaxy may have multiple versions of the same tool available. By default, you will be shown the latest version of the tool. This may NOT be the same tool used in the tutorial you are accessing. Furthermore, if you use a newer tool in one step, and try using an older tool in the next step⊠this may fail! To ensure you use the same tool versions of a given tutorial, use the Tutorial mode feature.
Open your Galaxy server
Click on the curriculum icon on the top menu, this will open the GTN inside Galaxy.
Navigate to your tutorial
Tool names in tutorials will be blue buttons that open the correct tool for you
Note: this does not work for all tutorials (yet)
You can click anywhere in the grey-ed out area outside of the tutorial box to return back to the Galaxy analytical interface
Warning: Not all browsers work!
Weâve had some issues with Tutorial mode on Safari for Mac users.
Try a different browser if you arenât seeing the button.
ATAC-seq utilizes a hyperactive Tn5 transposase (Kia et al. 2017) to ligate adaptors to genome fragments, created by the transposase. Performing ATAC-seq on individual cells used to be an expensive and time-consuming labour.
The 10X Chromium NextGEM system made scATAC-seq a cost-effective method for gaining high-resolution data with a simple protocol.
After the transposition of nuclei in bulk, individual nuclei are put into Gel beads in Emulsion (GEM), containing unique 10x cell barcodes and sequencing adaptors for Illumina sequencing.
Figure 1: An overview of the 10X single-nuclei ATAC-seq library preparation
Data
The 5k PBMCs dataset for this tutorial is available for free from 10X Genomics. The blood samples were collected from a healthy donor and were prepared following the Chromium Next GEM scATAC-seq protocol. After sequencing on Illumina NovaSeq, the reads were processed by the Cell Ranger ATAC 2.0.0 pipeline from 10X to generate a Fragments File.
The Fragments File is a tabular file in a BED-like format, containing information about the position of the fragments on the chromosome and their corresponding 10x cell barcodes.
SnapATAC2 requires 3 input files for the standard pathway of processing:
fragments_file.tsv: A tabular file containing the chromosome positions of the fragments and their corresponding 10x cell barcodes.
chrom_sizes.txt: A tabular file of the number of bases of each chromosome of the reference genome
gene_annotation.gtf.gz: A tabular file listing genomic features of the reference genome (GENCODE GRCh38)
A chromosome sizes file can be generated using the tool Compute sequence length ( Galaxy version 1.0.3).
The reference genome can either be selected from cached genomes or uploaded to the galaxy history.
Comment
This tutorial starts with a fragment file.
SnapATAC2 also accepts mapped reads in a BAM file.
SnapATAC2 Preprocessing ( Galaxy version 2.6.4+galaxy1) with the following parameters:
âMethod used for preprocessingâ: Convert a BAM file to a fragment file, using 'pp.make_fragment_file'
param-fileâFile name of the BAM fileâ: BAM_500-PBMC (Input dataset)
param-toggleâIndicate whether the BAM file contain paired-end readsâ: Yes
âHow to extract barcodes from BAM records?â: From read names using regular expressions
âExtract barcodes from read names of BAM records using regular expressionsâ: (................):
Comment
Not every regular expression type is supported.
This expression selects 16 characters if they are followed by a colon. Only the cell barcodes of the BAM file will match.
Rename the generated file to Fragments 500 PBMC
Now you can continue with either the fragments_file from earlier or the new file Fragments 500 PBMC.
galaxy-info The tool pp.make_fragment_filetool has implemented additional quality control (QC) measures. These filter out larger fragments, which will be noticeable in the log-scale fragment size distribution.
galaxy-info Please note that Fragments 500 PBMC only contains 500 PBMCs and thus the clustering will produce different outputs compared to the outputs generated by fragments_file (with 5k PBMC).
Preprocessing
Preprocessing of the scATAC-seq data contained in the fragment file with SnapATAC2 begins with importing the files and computing basic QC metrics.
SnapATAC2 compresses and stores the fragments into an AnnData object.
AnnData
The AnnData format was initially developed for the Scanpy package and is now a widely accepted data format to
store annotated data matrices in a space-efficient manner.
Import files to SnapATAC2
Hands On: Create an AnnData object
SnapATAC2 Preprocessing ( Galaxy version 2.6.4+galaxy1) with the following parameters:
âMethod used for preprocessingâ: Import data fragment files and compute basic QC metrics, using 'pp.import_data'
param-fileâFragment file, optionally compressed with gzip or zstdâ: fragments_file.tsv (Input dataset)
param-toggleâWhether the fragment file has been sorted by cell barcodesâ: No
This tool requires the fragment file to be sorted according to cell barcodes.
If pp.make_fragment_filetool was used to generate the fragment file, this has automatically been done.
Otherwise, the setting âsorted by cell barcodesâ should remain No.
Rename the generated file to Anndata 5k PBMC
Check that the format is h5ad
Because the AnnData format is an extension of the HDF5 format, i.e. a binary format, an AnnData object can not be inspected directly in Galaxy by clicking on the galaxy-eye (View data) icon.
Instead, we need to use a dedicated tool from the AnnData suite.
Hands On: Inspect an AnnData object
Inspect AnnData ( Galaxy version 0.10.3+galaxy0) with the following parameters:
param-fileâAnnotated data matrixâ: Anndata 5k PBMC
âWhat to inspect?â: General information about the object
How many observations are there? What do they represent?
How many variables are there?
There are 14232 observations, representing the cells.
There are 0 variables, representing genomic regions. This is because genome-wide 500-bp bins are only added after initial filtering.
Many toolsets producing outputs in AnnData formats in Galaxy, provide the general information by default:
Click on the name of the dataset in the history to expand it.
General Anndata information will be given in the expanded box:
e.g.
[n_obs x n_vars]
- 14232 Ă 0
details This feature isnât the most reliable and might display:
[n_obs x n_vars]
- 1 Ă 1
In such cases and for more specific queries, Inspect AnnData ( Galaxy version 0.10.3+galaxy0) is required.
Calculate and visualize QC metrics
Hands On: Fragment-size distribution
SnapATAC2 Plotting ( Galaxy version 2.6.4+galaxy1) with the following parameters:
âMethod used for plottingâ: Plot fragment size distribution, using 'pl.frag_size_distr'
param-fileâAnnotated data matrixâ: Anndata 5k PBMC (output of pp.import_datatool)
galaxy-eye Inspect the .png output
SnapATAC2 Plotting ( Galaxy version 2.6.4+galaxy1) with the following parameters:
âMethod used for plottingâ: Plot fragment size distribution, using 'pl.frag_size_distr'
param-fileâAnnotated data matrixâ: Anndata 5k PBMC (output of pp.import_datatool)
âChange the y-axis (fragment counts) to log scaleâ: Yes
galaxy-eye Inspect the .png output
Question
What distinct features do the plots have? And what do they represent?
Which fragments are generally from open chromatin?
3 peaks are clearly visible (at <100-bp, ~200-bp and ~400-bp). The smallest fragments are from nucleosome-free regions, while the larger peaks of 200- and 400-bp contain mono- and di-nucleosome fragments, respectively.
The small fragments (<100-bp) are from open chromatin reads, since the Tn5 transposase could easily access the loosely packed DNA (Yan et al. 2020).
The transcription start site enrichment (TSSe) is another important QC metric. Nucleosome-free fragments are expected to be enriched at transcription start sites (TSS). TSSe shows increased fragmentation of chromatin around the TSS. This suggests open and accessible nucleosome-free chromatin.
TSSe is used as a QC metric, since an increased enrichment around TSS regions suggests that the experiment has captured biological meaningful genomic features.
TSSe scores of individual cells can be calculated using SnapATAC2âs metrics.tsse function.
Hands On: Calculate and Plot TSSe
SnapATAC2 Preprocessing ( Galaxy version 2.6.4+galaxy1) with the following parameters:
âMethod used for preprocessingâ: Compute the TSS enrichment score (TSSe) for each cell, using 'metrics.tsse'
param-fileâAnnotated data matrixâ: Anndata 5k PBMC (output of pp.import_datatool)
param-fileâGTF/GFF file containing the gene annotationâ: gene_annotation.gtf.gz (Input dataset)
Rename the generated file to Anndata 5k PBMC TSSe or add the tag galaxy-tagsTSSe to the dataset:
Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.
To tag a dataset:
Click on the dataset to expand it
Click on Add Tagsgalaxy-tags
Add tag text. Tags starting with # will be automatically propagated to the outputs of tools using this dataset (see below).
Press Enter
Check that the tag appears below the dataset name
Tags beginning with # are special!
They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):
a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;
dataset 3 is used to calculate read coverage using BedTools Genome Coverageseparately for + and - strands. This generates two datasets (4 and 5 for plus and minus, respectively);
datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;
datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.
Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain âplusâ data versus âminusâ data. For example, does dataset 10 contain âplusâ data or âminusâ data? Probably âminusâ but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.
The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with #plus and #minus, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on⊠As a result it is straightforward to trace both branches (plus and minus) of this analysis.
SnapATAC2 Plotting ( Galaxy version 2.6.4+galaxy1) with the following parameters:
âMethod used for plottingâ: Plot the TSS enrichment vs. number of fragments density figure, using 'pl.tsse'
param-fileâAnnotated data matrixâ: Anndata 5k PBMC TSSe (output of metrics.tssetool)
galaxy-eye Inspect the .png output
High-quality cells can be identified in the plot of TSSe scores against a number of unique fragments for each cell.
Question
Where are high-quality cells located in the plot?
Based on this plot, how should the filter be set?
The cells in the upper right are high-quality cells, enriched for TSS. Fragments in the lower left represent low-quality cells or empty droplets and should be filtered out.
Setting the minimum number of counts at 5,000 and the minimum TSS enrichment to 10.0 is an adequate filter.
Filtering
Based on the TSSe plot the cells can be filtered by TSSe and fragment counts.
Hands On: Filter cells
SnapATAC2 Preprocessing ( Galaxy version 2.6.4+galaxy1) with the following parameters:
âMethod used for preprocessingâ: Filter cell outliers based on counts and numbers of genes expressed, using 'pp.filter_cells'
param-fileâAnnotated data matrixâ: Anndata 5k PBMC TSSe (output of metrics.tssetool)
âMinimum number of counts required for a cell to pass filteringâ: 5000
âMinimum TSS enrichemnt score required for a cell to pass filteringâ: 10.0
âMaximum number of counts required for a cell to pass filteringâ: 100000
Rename the generated file to Anndata 5k PBMC TSSe filtered or add the tag galaxy-tagsfiltered to the dataset
galaxy-eye Inspect the general information of the .h5ad output
How have the observations changed, compared to the first Anndata 5k PBMC AnnData file?
What does this tell us about the quality of the data?
There are only 4564 observations, compared to the initial 14232 observations.
And the obs: 'tsse' has been added (but already during metrics.tssetool)
The empty droplets and low-quality cells have been filtered out, leaving us with 4564 high-quality cells.
Feature selection
Currently, our AnnData matrix does not contain any variables. The variables will be added in the following step with the function pp.add_tile_matrix.
This creates a cell-by-bin matrix containing insertion counts across genome-wide 500-bp bins.
After creating the variables, the most accessible features are selected.
Hands On: Select features
SnapATAC2 Preprocessing ( Galaxy version 2.6.4+galaxy1) with the following parameters:
âMethod used for preprocessingâ: Generate cell by bin count matrix, using 'pp.add_tile_matrix'
param-fileâAnnotated data matrixâ: Anndata 5k PBMC TSSe filtered (output of pp.filter_cellstool)
âThe size of consecutive genomic regions used to record the countsâ: 500
âThe strategy to compute feature countsâ: "insertion": based on the number of insertions that overlap with a region of interest
SnapATAC2 Preprocessing ( Galaxy version 2.6.4+galaxy1) with the following parameters:
âMethod used for preprocessingâ: Perform feature selection, using 'pp.select_features'
param-fileâAnnotated data matrixâ: Anndata tile_matrix (output of pp.add_tile_matrixtool)
âNumber of features to keepâ: 250000
Comment: Select features
Including more features improves resolution and can reveal finer details, but it may also introduce noise.
To optimize results, experiment with the n_features parameter to find the most appropriate value for your dataset.
To demonstrate the differences when selecting features, the following UMAP plots are the outputs from processing with a number of features between 1,000 and 500,000.
Fewer features result in fewer, but larger clusters. Selecting a lot of features will output more granular clusters and the compute time will increase.
How did n_vars change compared to Anndata 5k PBMC TSSe filtered
Where are the selected features stored in the count matrix?
There are 6,062,095 variables, compared to 0 in TSSe filtered.
Selected features are stored in var: 'selected'.
Doublet removal
Doublets are removed by calling a customized scrublet (Wolock et al. 2019) algorithm. pp.scrublet will identify potential doublets and the function pp.filter_doublets removes them.
During single-cell sequencing, multiple nuclei can be encapsulated into the same 10x gel bead. The resulting multiplets (>97% of which are doublets) produce sequences from both cells.
Doublets can confound the results by appearing as ânewâ clusters or artifactual intermediary cell states.
These problematic doublets are called neotypic doublets, since they appear as ânewâ cell types.
Scrublet (Single-cell Remover of Doublets) is an algorithm which can detect neotypic doublets that produce false results.
The algorithm first simulates doublets by combining random pairs of observed cell features.
The observed features of the âcellsâ are then compared to the simulated doublets and scored on their doublet probability.
SnapATAC2âs pp.filter_doublets then removes all cells with a doublet probability >50%.
Cell doublets were removed. n_obs was reduced from 4564 to 4430 cells.
The outputs of pp.scrublet are stored in observations obs: 'doublet_probability', 'doublet_score' and in unstructured annotations uns: 'scrublet_sim_doublet_score'. The output of pp.filter_doublets is stored in uns: 'doublet_rate'.
Dimension reduction
Dimension reduction is a very important step during the analysis of single cell data. During this, the complex multi-dimensional data is projected into lower-dimensional space, while the lower-dimensional embedding of the complex data retains as much information as possible. Dimension reduction enables batch correction, data visualization and quicker downstream analysis since the data is more simplified and the memory usage is reduced (Zhang et al. 2024).
Dimension reduction algorithms can be either linear or non-linear.
Linear methods are generally computationally efficient and well scalable.
A popular linear dimension reduction algorithm is:
PCA (Principle Component Analysis), implemented in Scanpy (please check out our Scanpy tutorial for an explanation).
Nonlinear methods however are well suited for multimodal and complex datasets.
in contrast to linear methods, which often preserve global structures, non-linear methods have a locality-preserving character.
This makes non-linear methods relatively insensitive to outliers and noise while emphasizing natural clusters in the data (Belkin and Niyogi 2003)
As such, they are implemented in many algorithms to visualize the data in 2 dimensions (f.ex. UMAP embedding).
The nonlinear dimension reduction algorithm, through matrix-free spectral embedding, used in SnapATAC2 is a very fast and memory efficient non-linear algorithm (Zhang et al. 2024).
Spectral embedding utilizes an iterative algorithm to calculate the spectrum (eigenvalues and eigenvectors) of a matrix without computing the matrix itself.
The outputs of tl.spectral are stored in unstructured annotations uns: 'spectral_eigenvalue' and as multidimensional observations obsm: 'X_spectral'.
UMAP embedding
With the already reduced dimensionality of the data stored in X_spectral, the cells can be further embedded (i.e. transformed into lower dimensions) with Uniform Manifold Approximation and Projection (UMAP).
UMAP projects the cells and their relationship to each other into 2-dimensional space, which can be easily visualized (McInnes et al. 2018).
Hands On: UMAP embedding
SnapATAC2 Clustering ( Galaxy version 2.6.4+galaxy1) with the following parameters:
âDimension reduction and Clusteringâ: Compute Umap, using 'tl.umap'
param-fileâAnnotated data matrixâ: Anndata 5k PBMC spectral (output of tl.spectraltool)
âUse the indicated representation in â.obsmââ: X_spectral
Rename the generated file to Anndata 5k PBMC UMAP or add the tag galaxy-tagsUMAP to the dataset
Clustering
During clustering, cells that share similar accessibility profiles are organized into clusters. SnapATAC2 utilizes graph-based community clustering with the Leiden algorithm (Traag et al. 2019).
This method takes the k-nearest neighbor (KNN) graph as input data and produces well-connected communities.
Community clustering
Hands On: Clustering analysis
SnapATAC2 Clustering ( Galaxy version 2.6.4+galaxy1) with the following parameters:
âDimension reduction and Clusteringâ: Compute a neighborhood graph of observations, using 'pp.knn'
param-fileâAnnotated data matrixâ: Anndata 5k PBMC UMAP (output of tl.umaptool)
SnapATAC2 Clustering ( Galaxy version 2.6.4+galaxy1) with the following parameters:
âDimension reduction and Clusteringâ: Cluster cells into subgroups, using 'tl.leiden'
param-fileâAnnotated data matrixâ: Anndata knn (output of pp.knntool)
âWhether to use the Constant Potts Model (CPM) or modularityâ: modularity
Comment: CPM or modularity
make sure you selected modularity
the clusters produced by CPM are not represented well in the UMAP projections
Rename the generated file to Anndata 5k PBMC leiden or add the tag galaxy-tagsleiden to the dataset
galaxy-eye Inspect the general information of the .h5ad output
Where are the leiden clusters stored in the AnnData?
The clusters are stored in obs: 'leiden'
Plotting of clusters
Now that we have produced UMAP embeddings of our cells and have organized the cells into leiden clusters, we can now visualize this information with pl.umap.
Hands On: Plotting the clusters
SnapATAC2 Plotting ( Galaxy version 2.6.4+galaxy1) with the following parameters:
âMethod used for plottingâ: Plot the UMAP embedding, using 'pl.umap'
param-fileâAnnotated data matrixâ: Anndata 5k PBMC leiden (output of tl.leidentool)
âColorâ: leiden
âHeight of the plotâ: 500
galaxy-eye Inspect the .png output
Question
How many leiden clusters were discovered?
What does the distance of clusters to each other tell us about their chromatin states?
There are 13 leiden clusters.
Clusters in close proximity (f.ex. clusters 0 and 5) share a similar chromatin accessibility profile (and very likely also a similar cell type), compared to a cluster further away (f.ex. cluster 9).
Cell cluster annotation
After clustering the cells, they must be annotated. This categorizes the clusters into known cell types. Manual Cell Annotation requires known marker genes and varying expression profiles of the marker genes among clusters.
What does n_vars represent in Anndata 5k PBMC gene_matrix and what did it represent in Anndata 5k PBMC leiden?
The variables now represent accessible genes. There are 60606 accessible genes in our samples. In Anndata 5k PBMC leiden and all earlier AnnData the variables represented the 6062095 fixed-sized genomic bins.
Imputation with Scanpy and MAGIC
Similar to scRNA-seq data, the cell-by-gene-activity matrix is very sparse. Additionally, high gene variance between cells, due to technical confounders, could impact the downstream analysis.
In scRNA-seq, filtering and normalization are therefore required to produce a high-quality gene matrix.
Since the cell-by-gene-activity matrix resembles the cell-by-gene-expression matrix of scRNA-seq, we can use the tools of the Scanpy (Wolf et al. 2018) tool suite to continue with our data.
The count matrices of single-cell data are sparse and noisy.
Confounding issues, such as âdropoutâ effects, where some mRNA or DNA-segments are not detected although they are present in the cell, also result in some cells missing important cell-type defining features.
These problems can obscure the data, as only the strongest gene-gene relationships are still detectable.
The Markov Affinity-based Graph Imputation of Cells (MAGIC) algorithm (van Dijk et al. 2018) tries to solve these issues by filling in missing data from some cells with transcript information from similar cells.
The algorithm calculates the likely gene expression of a single cell, based on similar cells and fills in the missing data to produce the expected expression.
MAGIC achieves this by building a graph from the data and using data diffusion to smooth out the noise.
How did n_vars change, compared to Anndata 5k PBMC gene_matrix?
Which data was âlostâ, compared to Anndata 5k PBMC leiden, and must be added to the file in order to produce UMAP plots?
The number of genes was reduced from 60606 to 55106 by the filtering. Additional annotations were added, such as: obs: 'n_genes', 'n_counts', var: 'n_cells', 'n_counts' and uns: 'log1p'.
The UMAP embeddings obsm: 'X_umap' are missing and should be added to the Anndata in the next step. Without X_umap it wonât be possible to visualize the plots.
Copy-over embeddings
Hands On: Copy UMAP embedding
AnnData Operations ( Galaxy version 1.9.3+galaxy0) with the following parameters:
param-fileâInput object in hdf5 AnnData formatâ: Anndata 5k PBMC magic (output of external.pp.magictool)
âCopy embeddings (such as UMAP, tSNE)â: Yes
âKeys from embeddings to copyâ: X_umap
param-fileâIAnnData objects with embeddings to copyâ: Anndata 5k PBMC leiden
Comment: Annotations to copy
This tutorial only focuses on producing an UMAP plot with marker-genes.
If further analysis, with tools requiring more annotations, is intended, these can be added in a similar way as shown above.
f.ex. Peak and Motif Analysis with Snapatac2 peaks and motif ( Galaxy version 2.6.4+galaxy1) requires annotations from uns.
It is also possible to leave the input âKeys from embeddings to copyâ empty, to copy all annotations of a given category such as obsm.
Rename the generated file to Anndata 5k PBMC magic UMAP or add the tag galaxy-tagsUMAP to the dataset
galaxy-eye Inspect the general information of the .h5ad output, to check if obsm contains X_umap
The gene activity of selected marker genes can now be visualized with Scanpy.
Hands On: Plot marker genes
Plot with Scanpy ( Galaxy version 1.9.6+galaxy3) with the following parameters:
param-fileâAnnotated data matrixâ: output_h5ad (output of AnnData Operationstool)
âMethod used for plottingâ: Embeddings: Scatter plot in UMAP basis, using 'pl.umap'
âKeys for annotations of observations/cells or variables/genesâ: leiden, CD3D, CD8A, CD4, MS4A1, NKG7, CD14, FCER1A
param-toggleâShow edges?â: No
In âPlot attributesâ
âNumber of panels per rowâ: 2
Question
Are the marker genes selectively expressed in the clusters?
Which marker genes have overlapping expression profiles? And what could that imply?
Some marker genes, such as MS4A1 or CD8A, are only expressed in a few clusters (clusters 6+11 and clusters 1+8, respectively).
The marker gene CD3D is expressed in multiple clusters (1, 2, 4, 7 and 8). Overlapping expression profiles imply similar cell types since similar cell types have similar marker genes upregulated. In this case, CD3D expression classifies the cells in these clusters as T-cells.
Manual cluster annotation
Comparison of marker gene expression in our clusters with a table of canonical marker genes (Sun et al. 2019), enables us to annotate the clusters manually.
Cell type
Marker genes
CD8+ T cells
CD3D+, CD8A+, CD4-
CD4+ T cells
CD3D+, CD8A-, CD4+
B cells
CD3D-, MS4A1+
Natural killer (NK) cells
CD3D-, NKG7+
Monocytes
CD3D-, CD14+
Dendritic cells
CD3D-, FCER1A+
These canonical marker genes can match the clusters to known cell types:
Cluster
Cell type
0
Monocytes
1
CD8+ T cells
2
CD4+ T cells
3
Monocytes
4
CD4+ T cells
5
Monocytes
6
B cells
7
CD4+ T cells
8
CD8+ T cells
9
NK cells
10
Monocytes
11
B cells
12
Dendritic cells
Comment
Note that some clusters contain subtypes (f.ex. the annotated B cell clusters contain both naive and memory B cells). The cell-type annotation can be refined by choosing more specific marker genes.
To manually annotate the Leiden clusters, we will need to perform multiple steps:
Inspect the key-indexed observations of Anndata 5k PBMC gene_matrix magic UMAP
Cut the Leiden annotations out of the table
Upload a replace file containing the new cell type annotations for the Leiden clusters
Replace the values of the cluster annotation with cell type annotation
Add the cell type annotation to the AnnData
Plot the annotated cell types
Hands On: Manual annotation
Inspect AnnData ( Galaxy version 0.10.3+galaxy0) with the following parameters:
param-fileâAnnotated data matrixâ: Anndata 5k PBMC gene_matrix magic UMAP
âWhat to inspect?â: Key-indexed observations annotation
Rename the generated file to 5k PBMC observations
galaxy-eye Inspect the generated file
Question
In which column is the Leiden annotation located?
The Leiden annotation is in column 8.
Column 1
Column 2
Column 3
Column 4
Column 5
Column 6
Column 7
Column 8
Column 9
Column 10
ââ
n_fragment
frac_dup
frac_mito
tsse
doublet_probability
doublet_score
leiden
n_genes
n_counts
AAACGAAAGACGTCAG-1
22070
0.5219425551271499
0.0
30.43315066436454
0.004634433324822066
0.009276437847866418
8
52303
16521.599844068267
AAACGAAAGATTGACA-1
10500
0.5345125681606597
0.0
29.10551296093465
0.004668403569267374
0.001088139281828074
1
54501
15020.42495602328
AAACGAAAGGGTCCCT-1
19201
0.5101785714285714
0.0
19.90011850347046
0.004634433324822066
0.009276437847866418
5
54212
16294.751533305309
AAACGAACAATTGTGC-1
13242
0.487399837417257
0.0
29.060913705583758
0.004660125753854076
0.0022172949002217295
7
53530
15456.629863655084
Cut columns ( Galaxy version 1.0.2) with the following parameters:
param-selectâCut columnsâ: c8
param-fileâFromâ: 5k PBMC observations (output of Inspect AnnDatatool)
Yes. B-cells are far away from NK and T cells. Only the myeloid lineage of monocytes and dendritic cells are located close to each other. This might be due to the common progenitor cell lineage and thus a similar chromatin profile.
Yes. The most common cell types are T cells, followed by Monocytes and B cells. NK cells and Dendritic cells only make up a small percentage of PBMCs.
Conclusion
congratulations Well done, youâve made it to the end! You might want to consult your results with this control history, or check out the full workflow for this tutorial.
In this tutorial, we produced a count matrix of scATAC-seq reads in the AnnData format and performed:
Preprocessing:
Plotting the fragment-size distributions
Calculating and plotting TSSe scores
Filtering cells and selecting features (fixed-size genomic bins)
Dimension reduction through Spectral embedding and UMAP
Clustering of cells via the Leiden method
Cluster annotation
Producing and filtering a cell by gene activity matrix
Data normalization and imputation with Scanpy and the MAGIC algorithm
Visualizing marker genes in the clusters
Manually annotating the cell types with selected marker genes
The scATAC-seq analysis can now continue with downstream analysis, for example differential peak analysis.
The SnapATAC2 tools for differential peak analysis are already accessible on Galaxy. However, there are no GTN trainings available yet. Until such a tutorial is uploaded, you can visit the SnapATAC2 documentation for a tutorial on differential peak analysis. And check out our example history and the exemplary workflow.
The tools are available in Galaxy under SnapATAC2 Peaks and Motif ( Galaxy version 2.6.4+galaxy1).
If you want to continue with differential peak analysis, please make sure that the AnnData object with the annotated cell types contains unspecified annotations for the reference sequences (uns: 'reference_sequences').
The section Copy-over embeddings explains how to copy annotations from one AnnData object to another.
You've Finished the Tutorial
Please also consider filling out the Feedback Form as well!
Key points
Single-cell ATAC-seq can identify open chromatin sites
Dimension reduction is required to simplify the data while preserving important information about the relationships of cells to each other.
Clusters of similar cells can be annotated to their respective cell-types
Frequently Asked Questions
Have questions about this tutorial? Have a look at the available FAQ pages and support channels
Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.
References
Belkin, M., and P. Niyogi, 2003 Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. Neural Computation 15: 1373â1396. 10.1162/089976603321780317
Kia, A., C. Gloeckner, T. Osothprarop, N. Gormley, E. Bomati et al., 2017 Improved genome sequencing using an engineered transposase. BMC Biotechnology 17: 10.1186/s12896-016-0326-1
Wolf, F. A., P. Angerer, and F. J. Theis, 2018 SCANPY: large-scale single-cell gene expression data analysis. Genome Biology 19: 10.1186/s13059-017-1382-0
Dijk, D. van, R. Sharma, J. Nainys, K. Yim, P. Kathail et al., 2018 Recovering Gene Interactions from Single-Cell Data Using Data Diffusion. Cell 174: 716â729.e27. 10.1016/j.cell.2018.05.061
Amemiya, H. M., A. Kundaje, and A. P. Boyle, 2019 The ENCODE Blacklist: Identification of Problematic Regions of the Genome. Scientific Reports 9: 10.1038/s41598-019-45839-z
Sun, Z., L. Chen, H. Xin, Y. Jiang, Q. Huang et al., 2019 A Bayesian mixture model for clustering droplet-based single-cell transcriptomic data from population studies. Nature Communications 10: 10.1038/s41467-019-09639-3
Traag, V. A., L. Waltman, and N. J. van Eck, 2019 From Louvain to Leiden: guaranteeing well-connected communities. Scientific Reports 9: 10.1038/s41598-019-41695-z
Wolock, S. L., R. Lopez, and A. M. Klein, 2019 Scrublet: Computational Identification of Cell Doublets in Single-Cell Transcriptomic Data. Cell Systems 8: 281â291.e9. 10.1016/j.cels.2018.11.005
Yan, F., D. R. Powell, D. J. Curtis, and N. C. Wong, 2020 From reads to insight: a hitchhikerâs guide to ATAC-seq data analysis. Genome Biology 21: 10.1186/s13059-020-1929-3
Zhang, K., N. R. Zemke, E. J. Armand, and B. Ren, 2024 A fast, scalable and versatile tool for analysis of single-cell omics data. Nature Methods 21: 217â227. 10.1038/s41592-023-02139-9
Glossary
PBMCs
peripheral blood mononuclear cells
QC
quality control
TSS
transcription start sites
TSSe
transcription start site enrichment
UMAP
Uniform Manifold Approximation and Projection
scATAC-seq
Single-cell Assay for Transposase-Accessible Chromatin using sequencing
Feedback
Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012
@misc{single-cell-scatac-standard-processing-snapatac2,
author = "Timon Schlegel",
title = "Single-cell ATAC-seq standard processing with SnapATAC2 (Galaxy Training Materials)",
year = "",
month = "",
day = "",
url = "\url{https://training.galaxyproject.org/training-material/topics/single-cell/tutorials/scatac-standard-processing-snapatac2/tutorial.html}",
note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
doi = {10.1371/journal.pcbi.1010752},
url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
year = 2023,
month = {jan},
publisher = {Public Library of Science ({PLoS})},
volume = {19},
number = {1},
pages = {e1010752},
author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn GrĂŒning and B{\'{e}}r{\'{e}}nice Batut and},
editor = {Francis Ouellette},
title = {Galaxy Training: A powerful framework for teaching!},
journal = {PLoS Comput Biol}
}
Congratulations on successfully completing this tutorial!
Do you want to extend your knowledge?
Follow one of our recommended follow-up trainings: