name: inverse layout: true class: center, middle, inverse
--- # Plates, Batches, and Barcodes --- ## Requirements Before diving into this slide deck, we recommend you to have a look at: - [Introduction to Galaxy Analyses](/archive/2019-11-01/topics/introduction) - [Sequence analysis](/archive/2019-11-01/topics/sequence-analysis) - Quality Control: [
slides
slides](/archive/2019-11-01/topics/sequence-analysis/tutorials/quality-control/slides.html) - [
tutorial
hands-on](/archive/2019-11-01/topics/sequence-analysis/tutorials/quality-control/tutorial.html) - Mapping: [
slides
slides](/archive/2019-11-01/topics/sequence-analysis/tutorials/mapping/slides.html) - [
tutorial
hands-on](/archive/2019-11-01/topics/sequence-analysis/tutorials/mapping/tutorial.html) .footnote[Tip: press `P` to view the presenter notes] ??? Presenter notes contain extra information which might be useful if you intend to use these slides for teaching. Press `P` again to switch presenter notes off --- ### <i class="fa fa-question-circle" aria-hidden="true"></i><span class="visually-hidden">question</span> Questions - What are barcodes and how are they applied to batches? - What is the difference between a plate and a batch? - Are batches and lanes the same? - Why are plating setups important to know? - Is it necessary to check barcodes that don't exist in a batch? --- ### <i class="fa fa-bullseye" aria-hidden="true"></i><span class="visually-hidden">objectives</span> Objectives - How to describe a plating setup - Proper naming conventions when dealing with multiple batches - Checking for false positives across batches --- ### Sorting Plates .image-100[] .left[Plates are *N x M* arrays of wells that cells are sorted, to then be individually amplified and sequenced.] --- ### Sorting Plates .image-100[] .left[Not all wells need to be occupied by cells, some wells are left empty.] .center[Why is this the case?] --- ### Sorting Plates .image-100[] .center[Library preparation is not always perfect] .center[Sample **contamination** from neighbouring wells] --- ### Plates and Lanes .image-50[] .center[Sorting Plates are divided into *lanes*:] .image-50[] ??? Plates are $$N \times M$$ arrays or wells that cells are sorted into and then individually amplified and sequenced. The way these slot are filled depends entirely on the protocol, but usually not all slots are filled. The reason for this will become clear momentarily. * Lanes demarcate evenly-sized *n x m* rectangular regions on a plate (where *n < N and m < M*). * Multiple lanes can exist on plate, lanes ideally do not overlap within a plate. --- ### Plates and Lanes .image-100[] * All slots within a lane are sequenced at the same time. * All lanes within a plate *may or may not* be sequenced at the same time. * Lanes can therefore be thought of as different *batches*. --- ### Lanes and Batches .image-100[] ??? * Lanes and batches describe the same spaces, but not the same samples. * For this reason, batch numbers should *not* persist across different plates. --- ### Distinguishing cells in a plate .image-100[] * Cells are selected from a plate by their *barcodes* * Barcodes must be unique + e.g. *4 x 18 = 72* slots in the plate, need 72 unique cell barcodes ??? The way these slots are filled depends entirely on the protocol, but usually not all slots are filled. The reason for this will become clear momentarily --- ### What are Cell Barcodes? .image-100[] * Barcodes are added to *all* transcripts of a *specific* cell * Transcripts with different cell barcodes originate from different cells ??? --- ### What are Cell Barcodes? .image-100[] * Many different cell barcodes are used across many different cells * Each well in a plate contains a cell, indexed by its cell barcode --- ### Questions about Cell Barcodes .image-75[] .left[Assuming a lane is a 20x5 array:] 1. How many cell barcodes are needed for a single lane? 1. How many cell barcodes are needed for a plate with 10 lanes? 1. What would be the minimum length of the barcodes for each of the previous questions? -- .footnote[ 1. *20 x 5 = 100 unique barcodes per lane* 1. *20 x 5 x 10 = 1,000 unique barcodes per plate* ] --- ### Questions about Cell Barcodes .left[<small>3) What would be the minimum length of the barcodes for each of the previous questions?</small>] .pull-left[ A single lane? * 100 barcodes, 4 nucleotides for each base of a barcode | Barcodes | Result | |---------|------| | $$4^2 = 16$$ | <small>No, 2 bases is not enough | | $$4^3 = 64$$ | <small>No, 3 bases is not enough | | $$4^4 = 256$$ | <small>Yes, 4 bases is enough to cover 100 barcodes (and more!) | ] -- .pull-right[ A plate with 10 lanes? * 1000 barcodes, 4 nucleotides for each base of a barcode | Barcodes | Result | |---------|------| |$$4^4 = 256$$ | <small>No, 4 bases is not enough | |$$4^5 = 1024$$ | <small>Yes, 5 bases is enough to cover 1,000 barcodes (just barely!) | ] --- ### Barcode Safeguarding <!-- TODO: * Find a way to get <style>.MathJax_Display { display: inline; }</style> working so that MathJax can render inline * The MathJax selector seems to apply directly to the element, which is impossible to override! --> .pull-left[ * Is 5 nucleotides really enough to capture 1000 cells? * What could go wrong if all barcodes are separated by 1 bp, as shown? ] .pull-right[ <small>*Edit distance = 1bp* AAAAA AAAAC AAAAG AAAAT AAACA AAAGA AAATA ···· CCCCC CCCCA CCCCG CCCCT CCCAC CCCGC CCCTC ···· · · · ] ??? If we assume that every barcode is separated from every other barcode by 1 bp, then the answer is 'yes' That's right, sequencing errors can mislabel a read to a completely different cell than from where the transcript originated from. --- ### Guarding against Sequencing Errors .pull-left[ * Sequencing errors can change the cell barcode * To guard against a 1 bp sequencing error, need to space barcodes by 2 bp * How many barcodes for a cell barcode of length 5 bp (ED = 2)? ] .pull-right[ <small>*Edit Distance = 2 bp* AAAAA AAACC AAAGG AAATT AACCA AAGGA AATTA ···· CCCCC CCCAA CCCGG CCCTT CCCAA CCGGC CCTTC ···· · · · ] -- $$ 4^{5-1} = 512 $$ --- ### Edit Distance : General Principle .pull-left[ e.g. For barcodes of length N=3: * Edit Distance of E=1: .center[`AAA AAC AAG AAT ACA ACC` ...] .center[<small>64 barcodes</small>] * Edit Distance of E=2: .center[`AAA ACC AGG ATT CAA CCC` ...] .center[<small>16 barcodes</small>] * Edit Distance of E=3: .center[`AAA CCC GGG TTT`] .center[<small>4 barcodes</small>] ] -- .pull-right[ Number of Barcodes : $$ 4^{N-(E-1)} $$ .center[For barcodes of length *N*, and Edit Distance of *E*.] ] --- ### How many available barcodes are there? .pull-left[ * Barcodes typically limited to 4 main bases {A,C,T,G}. * Number of available barcodes depends on: 1. The length of the barcode 2. The edit distance between adjacent barcodes * Must take sequencing errors into account ] -- .pull-right[ * Availability is balance between *barcode size* vs. *sequencing errors* * Typically between 4-10 bases, range of cells that can be labelled: $$[4^4, 4^{10}] = [256, 1048576]$$ * True range is lower due to edit distance. ] ??? * The availability of barcodes and the number of cells they can label is a design choice, where the technician must balance two opposing forces: barcode size vs sequencing errors. * For the first, this means that for a barcode $$N$$ bases long, there will $$4^N$$ barcodes available. Typically barcodes tend to span 4-10 bases ($$4^4 = 256$$ to $$4^{10} = 1048576$$), since longer barcodes tend to be more subjectable to sequencing errors. * The true number of barcodes used is actually smaller than $$4^N$$ due to the measures used to space barcodes apart from one another to reduce sequencing errors. --- ### Design Factors in Barcodes Is it better to use Longer or Shorter Barcodes? | Longer | Shorter | |--------|---------| | Offer larger range of barcodes | Offer smaller range of barcodes | | Prone to more sequencing errors | More robust against sequencing errors | | Must increase edit distance significantly | A smaller edit distance is more acceptable | | Can accommodate large edit distances | *Cannot* accommodate large edit distances | --- ### Cell Barcodes: Summary .pull-left[ * A single barcode sequence indexes a single cell * Every transcript in a specific cell has the same cell barcode * Barcodes are *designed* and are not random oligonucleotides * Barcode use is *limited* by length and plate size ] -- <hline> .pull-right[#### Need to balance: <small> | | | | |-|-|-| | Edit Distance | vs. | Barcode Length | | Number of Barcodes | vs. | Plate and Lane Size | </small> ] -- .pull-right[* How?] --- ### Intelligent Plating Strategies * The main contending questions are: 1. How large is each batch? ??? How many slots in a batch that we need to barcode for -- 2. How many batches on a plate? -- 3. Should each batch use the same barcodes? -- 4. What **constraints** are there on the plate? ??? A technician always has to balance quality against cost, and this is illustrated in the following examples: --- ### Example Setup .pull-left[ * Barcodes: * 24 unique barcodes, with an edit distance of E=2: AAA ACC AGG TTT TAA TCC ATT CCC CAA TGG NAA NCC CGG CTT GGG NGG NTT ANN GAA GCC GTT CNN GNN TNN ] -- .pull-right[ * Plates and Lanes: * 12 slots per lane (3x4) * 4 lanes per plate * Constraints * Only 2 lanes sequenced at the same time * i.e. half the plate is sequenced at the same time ] ??? Here we use N as an extra base just for example purposes, but you do sometimes see this in other barcodes. 2 lanes at a time, only half --- ### Example 1: Single Plate with a Single Lane .bottom-info-box[ Available Barcodes AAA ACC AGG TTT TAA TCC ATT CCC CAA TGG NAA NCC CGG CTT GGG NGG NTT ANN GAA GCC GTT CNN GNN TNN ] -- .image-75[] .pull-left[ * 12 total slots in Plate 1 * 12 slots in Lane 1 * All slots filled ] -- .pull-right[ * We only need to use half of our barcodes * Why is this wasteful? ] ??? * Half of the barcodes used for that lane, and the other half we can ignore. * Wasteful because we are not getting full use of our 24 barcodes in a single sequencing run --- ### Example 2: Single Plate with 2 Lanes .bottom-info-box[ Available Barcodes AAA ACC AGG TTT TAA TCC ATT CCC CAA TGG NAA NCC CGG CTT GGG NGG NTT ANN GAA GCC GTT CNN GNN TNN ] * 24 total slots in Plate 1 * 12 slots in Lane 1, and 12 slots in Lane 2 * All slots filled .image-75[] -- * We can use all of our barcodes * Maximum number of cells can be sequenced in a single run * Why might this be too optimal? ??? Here we use all barcodes since these lanes will be sequenced at the same time Let's look at one final example to see why using all our barcodes on a plate might not be optimal. --- ### Example 3: Single Plate with 2 Lanes, only 1 active .bottom-info-box[ Available Barcodes AAA ACC AGG TTT TAA TCC ATT CCC CAA TGG NAA NCC CGG CTT GGG NGG NTT ANN GAA GCC GTT CNN GNN TNN ] * 24 total slots in Plate 1 * 12 slots in Lane 1, and 12 slots in Lane 2 * Only slots in Lane 1 filled .image-75[] -- * Why are we still using all barcodes? * Why have we not filled in Lane 2? ??? All barcodes used, why leave one lane empty? --- ### Example 3: Single Plate with 2 Lanes, only 1 active .bottom-info-box[ Available Barcodes AAA ACC AGG TTT TAA TCC ATT CCC CAA TGG NAA NCC CGG CTT GGG NGG NTT ANN GAA GCC GTT CNN GNN TNN ] * What would it mean if we sequenced reads with {TTT,TAA,...,TNN} as their Cell Barcodes? .image-75[] -- * **Cross-Contamination** * There should be *no cells in that lane*! * These are contaminants from Lane 1 * We can ignore these reads and select for the {AAA,ACC,...,GTT} ??? If we see any reads in the Plate which contain barcodes {TTT,TAA,TCC, etc} then we know that some contamination has occurred *because there should be no cells there*. One reason is that the second lane was not completely cleaned before being used. --- ### Example 4: 2 Plates with 2 Lanes, only 1 active (alternate) .bottom-info-box[ Available Barcodes AAA ACC AGG TTT TAA TCC ATT CCC CAA TGG NAA NCC CGG CTT GGG NGG NTT ANN GAA GCC GTT CNN GNN TNN ] * Same as before but extra plate. A single lane used to check for contaminants. * Alternate barcodes used for each plate .image-50[] <small> * Why alternate active lanes across plates? </small> -- <small> * Plate 1 and Plate 2 are sometimes the *same plate* * Contaminants might carry over *undetected* if same lane is used! </small> ??? * Here we have repeated previous example, but with an extra plate. In the first plate, the first half of the barcodes are used, and in the second plate, the second half of the barcodes are used. * Why alternate the barcodes between plates? The full set of barcodes does not change, so why not keep the same format? * Loaded at different times, washed clean, re-used. * Again, the answer is to reduce cross-contamination. Plate2 will be loaded after Plate1 (and perhaps Plate2 and Plate1 will use the same plate!) If we see any reads in Plate2 that should not be there, we can now surmise where they came from. We also have the added benefit of protecting the cells in Plate2 from those that may have been used in Plate1, since they are in completely different positions across plates. --- ### Example 5: 1 Plate with 2 lanes, both active .bottom-info-box[ Available Barcodes AAA ACC AGG TTT TAA TCC ATT CCC CAA TGG NAA NCC CGG CTT GGG NGG NTT ANN GAA GCC GTT CNN GNN TNN ] * Both lanes filled. *All barcodes* are applied individually to *each* lane .image-100[] .pull-left[ * Why check all barcodes against each lane? * Why not separate lanes across different plates? ] -- .pull-right[ * If <small>{TTT,TAA,...,TNN}</small> is detected in Lane 1, or vice versa → **Contamination!** * *Benefit of* **detecting cross-contamination** *whilst still* **maximising plate usage** ] ??? 1. Why apply the full set of barcodes to each lane, when only half will actually label? 2. What benefit does this serve, instead of separating them over different plates as in the previous example? A1. This setup is actually the same as example 4, but with the two plates merged. Here we can check for cross-contamination in each lane by measuring the real cell labels against the false barcodes. If in lane 1, we detect a significant number of reads with cell barcodes of `TAA` or `ANN`, we can assume that some cross-contamination has occurred since we should not be able to detect these barcodes in that lane. The converse is also true of lane 2. A2. We have the benefit of detecting cross-contamination with the same advantages as example 4, but with the cost advantage of sequencing two batches at the same time. --- # Summary .pull-left[ * Cell Barcodes are sequences attached to transcripts to indicate what cell a transcript came from * Cell Barcodes are *designed* to the Plate/Lane setup ] -- .pull-right[ * Indicate what cell a transcript came from. * Reduce sequencing errors * Check for cross contamination ] --- ### <i class="fa fa-key" aria-hidden="true"></i><span class="visually-hidden">keypoints</span> Key points - Plates are split into Lanes - Batches are Lanes, but names are not reused - Multiple batches can exist in many Plates - Checking all barcodes across Batches eliminates false positives / cross-contamination - Intelligent plating strategies can guard against cross-contamination --- ### <i class="fa fa-graduation-cap" aria-hidden="true"></i><span class="visually-hidden">curriculum</span> Do you want to extend your knowledge? Follow one of our recommended follow-up trainings: - [Transcriptomics](/archive/2019-11-01/topics/transcriptomics) - Pre-processing of Single-Cell RNA Data: [<i class="fa fa-laptop" aria-hidden="true"></i><span class="visually-hidden">tutorial</span> hands-on](/archive/2019-11-01/topics/transcriptomics/tutorials/scrna-preprocessing/tutorial.html) --- ## Thank you! This material is the result of a collaborative work. Thanks to the [Galaxy Training Network](https://wiki.galaxyproject.org/Teach/GTN) and all the contributors!
Mehmet Tekman
,
Alex Ostrovsky