Short Barcodes for Next-Generation Sequencing

Posted by:

Thomas Hanselmann

Published on:

Jul 19, 2018

This blog post reviews a must-read paper for anyone involved in next-generation sequencing (NGS) – “Short Barcodes for Next Generation Sequencing” by Mir et al. (2013) – and highlights the key concepts to consider when designing barcodes for NGS applications.

This review:

- Outlines what is meant by barcodes and explains how they are used in NGS
- Provides a clear overview to help readers without a mathematical background gain a better and deeper understanding of the paper
- Explains why interdisciplinary knowledge and collaboration is crucial for better understanding of problems and solutions

Successful solution

The paper, “Short Barcodes for Next Generation Sequencing” by Katharina Mir, Klaus Neuhaus, Martin Bossert, and Steffen Schober, illustrates in an excellent way how to develop a successful solution and improve a commercially interesting challenge in the DNA and RNA sequencing landscape.

Analysis of DNA from several patients in one flow cell

Modern high-throughput sequencers operate by reading out the sequence of the nucleotides A, C, G, or T (or U in the case of RNA) from single-stranded DNA or RNA fragments of up to 300 nucleotides, depending on the manufacturer and technology. This sequential process is massively parallelized in a flow cell in sequencing by synthesis systems from Illumina or QIAGEN, where millions of DNA fragments are processed.

Bioinformatics algorithms then reassemble the decoded fragements to reconstruct the original genome. The chemistry for a typical sequencing run currently costs around $1000, which is almost independent of the number of DNA fragments sequenced. As the throughput of sequencers has increased drastically in the last decade, it is even possible to analyze DNA from several patients in one sequencing run. But surely no doctor wants to diagnose a patient if there is a risk that the DNA samples from several patients were mixed up. So, the question is, how can DNA fragments from several patients be safely mixed into one flow cell?

Smart, elegant solution

The answer is very elegant and involves barcoding each DNA fragment in a pre-processing step (part of library preparation). If each DNA fragment from a patient is labeled, or extended with a unique barcode at the beginning or end of the fragment, the sequencing process will read out the barcode and then the actual fragment of interest, or vice-versa. The whole genome for each patient can now be assembled using bioinformatics algorithms because every fragment is identified by a unique barcode for that patient.

If a dozen patient samples are processed per sequencer run, this basically cuts the costs by a factor of the number of patients. This is in the realm of around $100 per patient which is comparable with other standard laboratory tests and could soon become accepted practice in certain diagnostic applications. What is the extra cost? The paper by Mir et al., mentions that an extra preparation step is necessary, namely to add a barcode to every DNA fragment. This will lower the throughput of DNA material because, for argument’s sake, the average DNA fragment is 150 base pairs long. A barcode should not add much more than a dozen additional nucleotides otherwise the effective throughput drops too much due to patient multiplexing. Plus, synthesizing longer barcodes is more expensive, which is what makes short barcodes so interesting.

So let’s turn to the paper now

The introduction to the paper outlines the kinds of errors that can occur in the sequencing process, i.e. basecalling errors (=read out of DNA sequence). The read-out is not error-free. This means that read out of the barcodes is also not error-free. So, how can mix-ups of patient DNA be prevented? The answer is by using special barcodes that can help detect or even correct errors. The clever way of doing this is the topic of the paper.

This review has already provided a rough description of the multiplexing method using barcodes and why this is highly attractive from an application point of view. The paper goes into much more depth regarding the technical details of how multiplexing and demultiplexing is performed. It then turns to the system model for using barcodes in the context of DNA sequencing, hereby combining knowledge of biological and communication systems. This is an excellent example of a collaboration between the authors who come from different fields of communication engineering and microbial ecology.

Simplified explanation of complicated concepts for non-specialists

The paper tackles the complicated concept of a communication channel, codewords (block codes of length n) to encode messages (here patient number m), the problem of how to transmit them over a channel (model of the sequencing process), the problem of how to pass the received codeword (r) to the encoder to safely decode the same message again, i.e. to obtain the same estimated patient number (= m) again, despite potential transmission (or channel) errors (=or read out errors in the biological system), and explains these things so well that even specialists without a background in communications will understand them easily (see Figure 2 of the paper).

Average and maximum decoding errors are introduced based on Shannon’s communication theory, which states that the average decoding error can be made arbitrarily small, if the codewords are long enough, i.e. safe patient DNA demultiplexing can be guaranteed with codewords that are long enough but which do not exceed the channel capacity.

A (bar)code is a set of codewords and its cardinality shows the number of different messages (= number of patients it can encode in the reality of an NGS barcode). As currently only a dozen or so patients have been multiplexed, the paper often uses codes with a cardinality of 48, as the combination of low cardinality and smart encoding can help to achieve higher error-tolerance to channel errors.

Balance between model complexity and easy code design

The simplifications made are crucial to the channel modelling (= system model), which assumes the practical and more relevant substitution errors to be independent and that no deletion and insertion errors occur which is reasonable for short codes. As an aside, if deletion and insertion errors occur, the code may no longer be able to correct errors but can still detect errors which makes it a good balance between model complexity and easy code design.

The decoder is based on a simple maximum posterior decoder, which given that DNA sample sizes are equal (after the PCR amplification process) equates to a maximum likelihood decoder. This leads to a simple decoding rule based on the Hamming distance, i.e. the number of different decoded nucleotides in the received word r (provided by the base caller) to the closest codeword in the code.

Special attention is paid to linear (block) codes as they are well studied and have the nice property of distance distribution to coincide with weight distribution.

Having gone through the necessary background, it is easy to understand what good codewords are, i.e. what a good barcode is, and an algorithm is provided for construction of such barcodes, given that there are some special design properties as minimal length homopolymers (repetition of same nucleotide) or the GC content.

Well thought out modeling for barcode design

For statisticians, there is no true model but only better and worse models. However, this paper has come up with some of the best modelling I have seen in the area of barcode design for NGS. This is also confirmed in the “Results” section where real published barcodes, including barcodes available as kits for sequencing, clearly show that there is room for improvement with the scheme put forward. In particular, these results may be interesting for barcodes in NIPT (non-invasive prenatal tests) applications, where errors in the barcode region are more prone to contribute to cross-talk and, thus, are more critical for the quality of the diagnosis.

Conclusions

This is a real gem of a paper and provides interesting insights into the use of short barcodes in NGS. NGS is a rapidly evolving area and it will surely not be long before the commercial benefits of short barcodes are exploited. In future, NGS kits could embrace this technology to maximize time-savings and reduce costs.

People need to become more aware of the design values STEM knowledge can provide. I’m sure readers will also recognize the benefits that the paper addresses. For me, it was a great joy to read it and I would like to congratulate the authors on their excellent work.

Outlook and further potential

HSE•AG can help kit manufacturers to implement these kinds of barcodes. We have expertise in modelling of complex systems in life sciences thanks to our highly experienced, cross-disciplinary team of engineers and scientists, many of whom have a Ph.D. We can build system models with the appropriate complexity to obtain innovative solutions. It is also important to make sure not to reinvent the wheel, but to keep the bigger picture in mind and recognize how to use innovative solutions that are already available. These can be incorporated into larger systems while keeping system design manageable.

Thomas Hanselmann

Touch Base with us

Successful automation of life science and diagnostics workflows is a highly complex undertaking. With our key technology and application knowledge as well as with our high level of experience we will help you to shorten your time-to-market and grow your business.

Contact

Short Barcodes for Next-Generation Sequencing

Successful solution

Analysis of DNA from several patients in one flow cell

Smart, elegant solution

So let’s turn to the paper now

Simplified explanation of complicated concepts for non-specialists

Balance between model complexity and easy code design

Well thought out modeling for barcode design

Conclusions

Outlook and further potential

Touch Base with us

Touch Base with us

Related Posts