Short Barcodes for Next-Generation Sequencing
Posted by:
Thomas Hanselmann
Published on:
Jul 19, 2018
This blog post reviews a must-read paper for anyone involved in next-generation sequencing (NGS) – “Short Barcodes for Next Generation Sequencing” by Mir et al. (2013) – and highlights the key concepts to consider when designing barcodes for NGS applications.
This review:
- Outlines what is meant by barcodes and explains how they are used in NGS
- Provides a clear overview to help readers without a mathematical background gain a better and deeper understanding of the paper
- Explains why interdisciplinary knowledge and collaboration is crucial for better understanding of problems and solutions
Successful solution
The paper, “Short Barcodes for Next Generation Sequencing” by Katharina Mir, Klaus Neuhaus, Martin Bossert, and Steffen Schober, illustrates in an excellent way how to develop a successful solution and improve a commercially interesting challenge in the DNA and RNA sequencing landscape.
Analysis of DNA from several patients in one flow cell
Modern high-throughput sequencers operate by reading out the sequence of the nucleotides A, C, G, or T (or U in the case of RNA) from single-stranded DNA or RNA fragments of up to 300 nucleotides, depending on the manufacturer and technology. This sequential process is massively parallelized in a flow cell in sequencing by synthesis systems from Illumina or QIAGEN, where millions of DNA fragments are processed.
Bioinformatics algorithms then reassemble the decoded fragements to reconstruct the original genome. The chemistry for a typical sequencing run currently costs around $1000, which is almost independent of the number of DNA fragments sequenced. As the throughput of sequencers has increased drastically in the last decade, it is even possible to analyze DNA from several patients in one sequencing run. But surely no doctor wants to diagnose a patient if there is a risk that the DNA samples from several patients were mixed up. So, the question is, how can DNA fragments from several patients be safely mixed into one flow cell?
Smart, elegant solution
The answer is very elegant and involves barcoding each DNA fragment in a pre-processing step (part of library preparation). If each DNA fragment from a patient is labeled, or extended with a unique barcode at the beginning or end of the fragment, the sequencing process will read out the barcode and then the actual fragment of interest, or vice-versa. The whole genome for each patient can now be assembled using bioinformatics algorithms because every fragment is identified by a unique barcode for that patient.
If a dozen patient samples are processed per sequencer run, this basically cuts the costs by a factor of the number of patients. This is in the realm of around $100 per patient which is comparable with other standard laboratory tests and could soon become accepted practice in certain diagnostic applications. What is the extra cost? The paper by Mir et al., mentions that an extra preparation step is necessary, namely to add a barcode to every DNA fragment. This will lower the throughput of DNA material because, for argument’s sake, the average DNA fragment is 150 base pairs long. A barcode should not add much more than a dozen additional nucleotides otherwise the effective throughput drops too much due to patient multiplexing. Plus, synthesizing longer barcodes is more expensive, which is what makes short barcodes so interesting.
So let’s turn to the paper now
The introduction to the paper outlines the kinds of errors that can occur in the sequencing process, i.e. basecalling errors (=read out of DNA sequence). The read-out is not error-free. This means that read out of the barcodes is also not error-free. So, how can mix-ups of patient DNA be prevented? The answer is by using special barcodes that can help detect or even correct errors. The clever way of doing this is the topic of the paper.
This review has already provided a rough description of the multiplexing method using barcodes and why this is highly attractive from an application point of view. The paper goes into much more depth regarding the technical details of how multiplexing and demultiplexing is performed. It then turns to the system model for using barcodes in the context of DNA sequencing, hereby combining knowledge of biological and communication systems. This is an excellent example of a collaboration between the authors who come from different fields of communication engineering and microbial ecology.
Simplified explanation of complicated concepts for non-specialists
The paper tackles the complicated concept of a communication channel, codewords (block codes of length n) to encode messages (here patient number m), the problem of how to transmit them over a channel (model of the sequencing process), the problem of how to pass the received codeword (r) to the encoder to safely decode the same message again, i.e. to obtain the same estimated patient number (= m) again, despite potential transmission (or channel) errors (=or read out errors in the biological system), and explains these things so well that even specialists without a background in communications will understand them easily (see Figure 2 of the paper).
Average and maximum decoding errors are introduced based on Shannon’s communication theory, which states that the average decoding error can be made arbitrarily small, if the codewords are long enough, i.e. safe patient DNA demultiplexing can be guaranteed with codewords that are long enough but which do not exceed the channel capacity.
A (bar)code is a set of codewords and its cardinality shows the number of different messages (= number of patients it can encode in the reality of an NGS barcode). As currently only a dozen or so patients have been multiplexed, the paper often uses codes with a cardinality of 48, as the combination of low cardinality and smart encoding can help to achieve higher error-tolerance to channel errors.
Balance between model complexity and easy code design
The simplifications made are crucial to the channel modelling (= system model), which assumes the practical and more relevant substitution errors to be independent and that no deletion and insertion errors occur which is reasonable for short codes. As an aside, if deletion and insertion errors occur, the code may no longer be able to correct errors but can still detect errors which makes it a good balance between model complexity and easy code design.
The decoder is based on a simple maximum posterior decoder, which given that DNA sample sizes are equal (after the PCR amplification process) equates to a maximum likelihood decoder. This leads to a simple decoding rule based on the Hamming distance, i.e. the number of different decoded nucleotides in the received word r (provided by the base caller) to the closest codeword in the code.
Special attention is paid to linear (block) codes as they are well studied and have the nice property of distance distribution to coincide with weight distribution.
Having gone through the necessary background, it is easy to understand what good codewords are, i.e. what a good barcode is, and an algorithm is provided for construction of such barcodes, given that there are some special design properties as minimal length homopolymers (repetition of same nucleotide) or the GC content.
Well thought out modeling for barcode design
For statisticians, there is no true model but only better and worse models. However, this paper has come up with some of the best modelling I have seen in the area of barcode design for NGS. This is also confirmed in the “Results” section where real published barcodes, including barcodes available as kits for sequencing, clearly show that there is room for improvement with the scheme put forward. In particular, these results may be interesting for barcodes in NIPT (non-invasive prenatal tests) applications, where errors in the barcode region are more prone to contribute to cross-talk and, thus, are more critical for the quality of the diagnosis.
Conclusions
This is a real gem of a paper and provides interesting insights into the use of short barcodes in NGS. NGS is a rapidly evolving area and it will surely not be long before the commercial benefits of short barcodes are exploited. In future, NGS kits could embrace this technology to maximize time-savings and reduce costs.
People need to become more aware of the design values STEM knowledge can provide. I’m sure readers will also recognize the benefits that the paper addresses. For me, it was a great joy to read it and I would like to congratulate the authors on their excellent work.
Outlook and further potential
HSE•AG can help kit manufacturers to implement these kinds of barcodes. We have expertise in modelling of complex systems in life sciences thanks to our highly experienced, cross-disciplinary team of engineers and scientists, many of whom have a Ph.D. We can build system models with the appropriate complexity to obtain innovative solutions. It is also important to make sure not to reinvent the wheel, but to keep the bigger picture in mind and recognize how to use innovative solutions that are already available. These can be incorporated into larger systems while keeping system design manageable.
Touch Base with us
Related Posts
Video: Speed up Robotics in Conjunction with Smart AI – Enabling Improved Business Cases
When it comes to next generation systems and speed up robotics, the challenge often is to find an appropriate business case. Fast technology changes, superseded ...
Technology Hub30/06/2021
Who Holds the Keys to the Future of Synthetic Biology?
Few life science sectors are gaining economical traction in the same way as synthetic biology. Advancements in genome sequencing and engineering technologies allow us to ...
Technology Hub02/12/2019
Why Data Analysis Workflows Belong to Modern Systems Engineering?
An Example of Fluidic Pressure Data from a DNA Sequencing Machine using simple PCA & Clustering.
Technology Hub21/03/2018
Liquid handling workflow innovation creates wow effect
Groundbreaking innovations take time and involve considerable project risks. We use the experience gained from countless automation projects to tackle new developments ...
Technology Hub22/03/2024
10 Advantages of UV LED Technology for PCR-Grade Decontamination
Just like any other laboratory equipment, automated work decks need to be sterilized for successful experimental results. ultraviolet (UV) irradiation, which stops ...
Technology Hub10/01/2020
Day 2: BIO One-on-One Partnering™ at J.P. Morgan Healthcare Conference
HSE•AG attended the Biotechnology Innovation Organization (BIO) One-on-One Partnering™ at the J.P. Morgan Healthcare Conference in San Francisco, California.
Conferences & Events09/01/2018
Transforming Human Genetics Through Innovation
We look forward to attending the European Human Genetics Conference (ESHG) 2024 in Berlin, Germany from 1–4 June. This important scientific and professional event ...
Conferences & Events30/04/2024
CAR T-Cell Therapy: Revolutionizing Blood Cancer Treatment
CAR T-cell therapy is a groundbreaking treatment primarily used for certain types of blood cancers. This innovative approach is making significant strides in the medical ...
Technology Hub22/07/2024
7 Secret Tricks to Optimize Emulsion PCR in the NGS Workflow
Next-generation sequencing (NGS) has revolutionized our understanding of life science, with more and more laboratories set to jump on the NGS bandwagon in the race to ...
Technology Hub13/09/2017
What is CAR T-Cell Therapy?
CAR T-cell therapy is a form of immunotherapy against cancer in which the patient's own T-cells are genetically modified in the laboratory to recognise and attack cancer ...
Technology Hub05/07/2024
OEM family for all sample quality measurements
The quality control of samples is increasingly becoming a must-have, particularly for the so-called omics applications. Based on the intelligent sample handling of the ...
Technology Hub02/02/2024
Honored with the Kununu Top Company Award 2024
We are delighted to announce that our company has received the prestigious Top Company Award 2024 from Kununu. This award is not only a proud moment for us as a company, ...
News & Updates HSE Life07/06/2024
PRESS RELEASE: PreON — Fully Automated Protein Sample Preparation for Mass Spectrometry
PreOmics GmbH, a developer of innovative technologies for mass spectrometry-based sample preparation, and HSE•AG, a leading provider of laboratory automation, are proud ...
News & Updates24/05/2019
Cutting-Edge Innovation in Laboratory Automation
Taking place in Barcelona, Spain, from 27–29 May, this important event draws scientists, academic researchers, and industry professionals from around the world. This ...
Conferences & Events19/04/2024
Functional Demonstrators and Concepts for Budget Approval
Securing the necessary financial resources can be a significant challenge. A working demonstrator or functional model is highly persuasive to decision makers as it ...
News & Updates Technology Hub31/10/2024