The first indicator for the quality of your sequencing data is the per base sequence quality of your raw reads. Often you will see a decreasing quality with increasing base position just as in the FASTQC image below (Fig. 1). But what is the reason for this and what are the consequences?
First of all, don’t panic! It is a normal and well known phenomenon. The reason of the decreasing sequence quality lies in the sequencing technology of Illumina.
Illumina relies on the so called sequencing by synthesis procedure. During each cycle of the process the sequencer washes chemicals that include variants for all four nucleotide over the flow cell (which has different clusters with identical DNA fragments for each cluster). The nucleotides have a blocker (terminator cap) so that only 1 base gets added to each molecule of DNA at a time. After the detection of the coupled fluorescence signal the blocker can be removed and the cycle can start again. This way, the DNA fragments in each cluster get sequenced synchronously by expressing specific fluorescence signals.
During the sequencing process different errors can occur. The main reason for the decreasing sequence quality is the so called phasing. Phasing means that the blocker of a nucleotide is not correctly removed after signal detection. In the next cycle no new nucleotide can bind on this DNA fragment and the old nucleotide is detected one more time whereby the fluorescence signal of this old nucleotide (probably) differs from the synchronous signal of the other nucleotides (Fig. 2). From now on this DNA fragment will be 1 cycle behind the rest (out of phase), polluting the light signal that the sequencer's camera has to read. A similar effect occurs if a nucleotide has a defect terminator cap (prephasing). In this case two nucleotides can bind in one cycle whereby the fragment will be 1 cycle before the rest.
These errors occur with a low probability. But over time (with increasing read length) they add up and pollute the light signal more and more. The signal gets more and more asynchronous. And since the light signal is used to calculate quality scores the asynchronous signal results in a decreasing sequence quality score.
As we now know the decreasing base sequence quality is due to a unwanted but unavoidable process. It limits the length of high quality reads. New chemicals are largely intended to minimize the phasing problem, increasing the length of reads before quality begins to decrease.
Last updated on January 20, 2017
ecSeq is a bioinformatics solution provider with solid expertise in the analysis of high-throughput sequencing data. We organize public workshops and conduct on-site trainings on NGS data analysis.
Would you like to receive updates about our NGS trainings and solutions? Then sign-up for our newsletter