Friday, 17 January 2014

NextSeq 500's new chemistry described

NextSeq 500 uses a two-colour chemistry rather than the original four-colours. This makes a massive difference to the complexity of producing reagents, the instrumentation and the computation; all are effectively reduced by a factor of two. So how does it work?

Update 290114: I got confirmation at an Illumina event yesterday that the A base carries a single fluorophore and that a mix of A bases labelled green or red is used to generate the A signal in both channels. Clusters on NextSeq are huge, the flowcell is massive compared to a MiSeq or HiSeq because there need to be lots of mm2 to keep the number of clusters high. Clusters also contain around 5000 molecules compared to the usually quoted figure of 1000 on HiSeq.

I've not seen a detailed description of the chemistry yet but thought I'd start with an image from Illumina showing how the four-colours work and the camera system in the older sequencers, the image of the cameras comes from the Bentley Nature paper of 2008: Accurate whole human genome sequencing using reversible terminator chemistry. Each of the four bases is labelled with a separate dye, these are imaged by using two lasers and a filter wheel to allow discrimination between the two-colours per laser. Overlap in the spectra mean there is not perfect discrimination between each base. Four pictures are required per cycle.

The new chemistry: Below is an image I put together based on my understanding of how the NextSeq 500 chemistry works. If this is wrong I accept no blame but would welcome comments to improve the figure. The four bases are no longer labelled with four colours: in the new chemistry only two dyes are used red & green. Two bases are labelled with single dyes, a third with both dyes and the fourth with no dye at all.

The figure below shows a single tile over five SBS cycles with each cluster showing its respective base, one base is highlighted and the basecalls and dye-colour are shown underneath the respective tiles. Tile 2 additionally shows the two images that would be used in basecalling. Grey indicate a null-cluster in these examples.

Illumina's newest SBS chemistry: clever stuff huh!

Only two pictures are taken as opposed to the four in the previous incarnation of SBS. And in each picture each cluster either appears in a single channel (T or C), in both channels (A),  or in no channel at all (G). Two pictures instead of four makes computation much easier and also makes a new version of RTA performs even better on low complexity libraries in both the sequence and index reads.

This simpification could reduce the cost of producing Ilumina seqeucning reagents and instruments significantly meaning cheaper seqeuncig for you and me and even better profits for Illumina. That share price should hit $200 by the end of the year.


  1. If this is the case what would happen if you started with a long run of Gs?Has cluster definition changed or would you miss these?

  2. Good point. For a start showing the clusters as grey is misleading as they should be dark, that is not visible above background if there's no fluorescent label.

    So Illumina must be 100% sure the label-less G will 100% incorporate and the 3' prime protecting group be fully removed in every cycle.....even in difficult GC-rich regions. This can't be possible.

    Suffice to say, it's fundamental scientific methodology that the absence of a result is not at all the same and can't be assumed to be a positive result. I also wonder about the dual tags and instances in which they may not be both accessible or visible to the laser and cameras.

  3. They fully control the clustering adapters and sequencing primers. It should not be difficult to ensure absence of poly Gs at the top of the cluster.

  4. although the cluster does allow for error to occur. I guess that's the benefit of PCR, the downside being the inherent bias and so never as good as single molecule in addition to the epigenetic information that's lost. But really the simplification doesn't improve the chemistry as it's a compromise for data purposes.

  5. Do 100% of the unlabelled G really have to incoporate in a cluster? Isn't this just a matter of dephasing? As I understood, unlabeled nucleotides incorporate better than labeled ones. So dephasing should be even less in the G cycle, giving better data at the end of the read because the G cycle is actually more efficient? Correct me if I am wrong.

  6. Maybe but was there any issues with the labelled G before? There's not just steric considerations but also electrostatic interactions. By the same token the dual labeled nucleotide would have much increased steric bulk so might lead to decreased efficiency.

    The phasing could be better but the cluster architecture has changed and also the temperature has been increased which could lead to misincorporations like GT. But without a label how would you know if any of this is happening. These are big changes to the chemistry.

  7. I agree there have been a large number of changes to get to NextSeq 500 chemistry. Each time Illumina have given us new chemistry it has been better than before and as a very conservative company I don't think they'd have rushed into this. Also given that this sequencer is likely to become the default clinical machine I'd say that getting this change right has been a priority within Illumina.

    I'm not sure if the Adenine is dual-labelled, rather I think it is single-labelled with each dye and then mixed 50:50.

    I used grey to make it easier to see there should be a cluster present but as there is no signal you will not see a colour. It would be more accurate to show nothing at all.

  8. Although Illumina has somewhat modified the cluster technology, but not the basic science, and developed new and innovative sample preparation chemistries, to my knowledge this is the first time they have altered the nucleotides which have remained exactly the same since their development by Solexa nearly ten years ago. In fact fundamentally the concept remains intact as it’s the 3’ reversible block and cleavable linker that are inventive and not dye combination.

    “I'm not sure if the Adenine is dual-labelled, rather I think it is single-labelled with each dye and then mixed 50:50.” Then how could this 50:50 mixture of A’s with one dye and A’s with the other dye and T’s and C’s with the same two dyes be differentiated? ie A’s will just look like T’s and C’s. A’s have to be dual labelled in this scenario and so its steric bulk will be significantly increased. But then there’s the issue of determining what base is incorporated, not even to mention the issues that might arise from label-less G. For example, if a T is close to a C will it look like an A? I guess with the increased cluster density this might be a possibility.

    Changes to the chemistry that appear to simplify the process from a data and hardware perspective will actually complicate the chemistry at a molecular level and so the effect is lowering of the bar on quality and reliability. The question remains whether this reduction in accuracy will make enough difference to result in outright miss-calls. Illumina are gambling the cluster signals are good enough to absorb this looser chemistry and statistically provide the correct reads ie they’ll be more errors but hopefully not too many.

    I hope you’re right about Illumina’s continuing a conservative approach but there are troubling signs. For a start the concept of the $1,000 genome was always predicated on its application in facilitating the Personalized Medicine revolution so it’s quite disingenuous to claim that raw data meet this objective. It’s not even close but let’s not get into the bioinformatics costs and annotation, variant calling and comparison.

    Why deny access to the majority of their customer base by only selling the new machine in 10’s for $10,000,000? They’re identical and work independently of each other for goodness sake. Sure if the ‘instrument’ was compromised of ten unique units. (Perhaps I can buy the set of ten and on-sell individually for a small margin like a distributor?)

    However, it does make sense from a commercial perspective if the goal is a land grab shutting out competition by changing the dynamics of the industry to primarily service-based for the clinic and in doing so nailing down the regulatory standards with their technology as quickly as possible. They’re assuming the rest of the community will just have to make do with the NextSeq500 which I find quite strange given the HiSeq X 10 is just 10 identical instruments. It’s like they’re doing the very thing many people fear genomics might do, that is creating a hierarchy with first and second class researchers purely for economics purposes.

  9. ““I'm not sure if the Adenine is dual-labelled, rather I think it is single-labelled with each dye and then mixed 50:50.” Then how could this 50:50 mixture of A’s with one dye and A’s with the other dye and T’s and C’s with the same two dyes be differentiated? ie A’s will just look like T’s and C’s.”
    Assuming there is no dye bias on nucleotide incorporation efficiency, each ‘A cluster’ will incorporate 50% A, dye 1 and 50% A, dye 2 making them red and green clusters.

  10. The individual bases are probably/possibly single-labelled, but each strand will incorporate about 1000 individual bases at each cycle generating a 50:50 signal, i.e each cluster will show both red and green at any A cycles. However this would reduce intensity in both channels so mayvbe dual-labelling is used? Perhaps someone from Illumina can comment?

  11. ....but that's assuming they're only adding the two supposed A's by themselves kind of like a 454 protocol, ie not with the other three nucleotides. That would be a huge change in protocol adding another full cycle, including the need for image collection and so defeating the purpose, to the original process where all four were added simultaneously and detected simply by the four unique dyes.

    Otherwise a possibility, although only theoretically, could be the A's incorporate much quicker than T's and C's but then an intermediate image would again be required at the end of the A incorporation events before T's and C's began to incorporate. The events would have to be distinct with no overlap in time.

    After all, incorporation of T’s and C’s would also be shown as red and green clusters.

  12. I'm not sure I've explained what I'm thinking: with two differently labeled A bases 500 Green and 500 Red would be incorporated into the cluster with an A base. This should work in the same way as the normal mix of nucleotides. I'll certainly be asking Illumina at AGBT.

  13. In a single cycle if you only add 50% of each A and no other nucleotides the each A cluster will be 50% red and 50% green. Note this is different from 50% of the clusters being red and 50% green. In the first situation the other clusters will be unincorporated (unless fidelity issues are present) so the other fluorescent bases could be added. Then each cluster would be either; T 100% green, C 100% red, G 100% unlabelled and A 50% red 50% green. James made a good point about intensity which might make phasing correction if required harder in later cycles. The system should work with all bases in a single addition.

  14. Our FAS just told us the A is dual-labeled. This makes teh A-bases quite a bit "bigger" and this may affect error-rates in a way that is specific to NextSeq chemistry as proposed by earlier comments. I guess we'll have to wait and see.

  15. It’s been quite a while since I have look at the Illumina technology so I may have made an error here. My thought was with Manteia bridge amplification after the initial ligation both sense and antisense strands are created on the surface and remain present this being the basis of the paired-end protocol. In this scenario the clusters have the ability to show both ends of each strand through complementarity and hence a mixture of two colours at each cycle. But I believe now that one strand is actually cleaved which seems like a real waste of valuable information.

    So if the clusters are distinctive and not overlapping then each cluster would be red, green, red-green or dark at each cycle. So the issues with increased steric bulk of T and labelless-G still remain unclear as to the effect on accuracy and efficiency. If the clusters are overlapping then there are obviously issues.