Sunday, 2 February 2014

One of our bases is missing: where's the G in NextSeq chemistry

Illumina surprised pretty much everyone with their latest SBS chemistry. I was at an Illumina meeting last week where it was discussed at some length, with many in the audience concerned about possible new error modes due to the drastic change compared to the four-colour SBS used in HiSeq and MiSeq.

I posted an explanation of how the two-colour chemistry works a couple of weeks ago and my initial thoughts remain the same: the chemistry is likely to be an important step forward and Illumina are unlikely to have released it without a lot of confidence in it. So do we need to be particularly concerned about the lack of G signal or the dual-flurophore approach used for A?

Help, my G is missing: argumentum ad ignorantiam, "absence of evidence is not evidence of absence" is the point of view most people seem to have come from. There is concern that a null signal is not the same as a nice bright green spot. But in discussing this with colleagues I was struck by the fact that twenty years ago the lack of signal was the basis for Sanger sequencing working at all. Take a look at the image below, you can read the sequence yourself:
Anyone who remembers, or actually performed, radioactive sequencing will know what I am talking about. All those lanes to interrogate, reading bases out to a mate while guided by a ruler down the autoradiograph. A genome felt like an impossibility. But the "missing" signal was not an issue, but rather the key to success. 75% of bases in each track had no signal, so what’s all the fuss over the missing G?

But what about comparisons to HiSeq SBS: We'll have to wait for academic groups to sequence larger sample numbers than Illumina have so far produced to get a real handle on how well the SBS chemistry's compare. For now most data is still going to come from four-colour SBS. But in the future who knows. Will two-colour SBS trickle down/across to other platforms? I'm certainly liking the simplification of hardware and looking forward to a more robust instrument. After all most of my groups Tweets are to say how long our queue is or that our HiSeq has broken.

I only ever did radioactive sequencing once during my third year at Uni, it was a two week long exercise, what with development of the autoradiograph, and I got a few hundred base pairs for my project. That was in 1995, nineteen years later and in two weeks an XTen system could generate 640 30x Human genomes.


  1. There is actually one easy way to locate clusters while getting away from the "no G" signal -- you could just sequence either one of the indexes first. I haven't looked at anything in the sequencing process on the NextSeq, but it seems to be an easy workaround.

    It's not a great analogy to compare it to Sanger sequencing though. The lack of a signal in Sanger is complemented by a signal in another channel -- that is, if you did everything correctly there will be a signal somewhere. In the 2-channel SBS system, a G means there is no signal whatsoever on that cluster. It's especially disconcerting when you are dealing with homopolymers. At least when you have a signal you can make a ballpark estimate of what's happening, but when you have nothing it's really hard. Also, the way the "G" Q-score is calculated will be completely different and to me seems kind of sketchy.

  2. Perhaps, but at least you can see all of the four bases to some extent.

    Pretty much most things about Sanger sequencing are different so I'm not sure the comparison is valid. Although slow, Sanger is still the gold standard.

    Illumina don't have all the time in the world. Fluorescence-based sequencing is not the end game by a long way. It's just too expensive. They’re in a race. And if you want proof of Illumina's desperation look no further than the HiSeq 10 X with the 72 million dollars needed for bioinformatics costs over four years to claim the $1000 genome, which, let's face it. is complete b/s. Irrespective of bioinformatics, their not even close to achieving the $1000 genome.

    And it’s not just the absence of G-signal and confirmation of incorporation, it’s what else is happening that’s not being reported.

  3. I'm pretty sure the 72 million figure was not the bioinformatics cost - could anyone chime in? I tought that was all-in and amortized over the amount of genomes expected to be run you get pretty close to $1000?

    1. $72M comes from $1000/genome, 18k genomes over 4 years (72k genomes)
      The $1000 cost breaks down as follows:
      $800 for sequencing reagents
      ~$135 for instrument amortization
      ~$65 for library prep, labor and automated analysis (alignment)


  4. Correct me if I'm wrong, but previous SBS chemistries had four dyes, A and C in the red channel and G and T in the green. As two dyes exist in each channel, there is some spectral overlap which needs correcting for algorithmically - which would surely contribute towards sequencing error. One advantage of the new system is that now there are two dyes, each can exist in its own channel, minimising spectral overlap, removing the need for matrix calculations and, in theory actually reducing error rate....just saying, the new 2-dye system could improve error rates.

    1. The four original dyes were blue, green, red and yellow. The only issue was possible FRET under certain circumstances and this was covered in the choice of frequencies and design so not an issue.

  5. In the absence of a C nucleotide, G will preferentially pair with T and visa versa. Since the steric bulk of the label-free G has been decreased relative to the labeled C this then could increase the probability of G:T and T:G mismatches and misreporting.

  6. Actually it's worse than that. A as I understand is dual labelled so significantly sterically larger than label-less G. So this could also lead to the preferential misincorporation of G over A.