Some comments and analysis from the exciting and fast moving world of Genomics. This blog focusses on next-generation sequencing and microarray technologies, although it is likely to go off on tangents from time-to-time
We'll soon be back in sunny Florida (current forecast is low to mid 20's) for another cram packed Advances in Genome & Biotechnology meeting. Of course the focus is still on sequencing, but with instruments like to CyTof coming along and huge improvements in total proteome analysis, how long the Genome stay top of the heap is not clear.
Again the agenda is very full, and full of interesting stuff too! The standout presentation title has to be Saturdays talk at, 11:40–12:00 by Tatiana Moroz (University of Florida): Space Genomics: Epigenomic Mechanisms for Adaptations to Microgravity; other highlights for me are:
Thursday, February 26th:
7:30 p.m. – 7:50 p.m.Hie Lim Kim, Nanyang Technological University Khoisan Hunter-Gatherers Have Been the Largest Population Throughout Most of Modern Human Demographic History.
7:50 p.m. – 8:10 p.m.Karyn Meltz Steinberg, The Genome Institute at Washington University FinMetSeq: Exome Sequencing of 20,000 Finns Identity Known and Novel Associations With Cardiometabolic Traits
8:50 p.m. - 9:10 p.m.Richard Leggett, The Genome Analysis Centre Towards Real-Time Surveillance Approaches Using Nanopore Sequencers
9:10 p.m. – 9:30 p.m.Nicholas Navin, MD Anderson Cancer Center Single Cell Sequencing Identifies Clonal Stasis and Punctuated Copy Number Evolution in Triple-Negative Breast Cancer Patients
Friday, February 27th: 11:00 a.m. – 11:20 a.m.* Stephen Kingsmore, Children’s Mercy Kansas City Newborn Sequencing in Genomic Medicine and Public Health: Rapid Genome Sequencing for Genetic Disease Diagnosis in Neonatal Intensive Care Units
8:30 p.m. – 8:50 p.m.Lia Chappell, Wellcome Trust Sanger Institute Revealing Malaria Parasite Transcriptomes Using Directional, Amplification-Free RNA-seq
8:50 p.m. – 9:10 p.m.Roman Yelensky, Foundation Medicine A Novel Next
Generation Sequencing (NGS)-Based Companion Diagnostic Predicts Response
to the PARP Inhibitor Rucaparib in Ovarian Cancer
The focus is very much on the options and challenges users face when making decisions about how to make NGS libraries. The article has sections on fragmentation; DNA library prep; RNA-seq library prep; complexity, bias and batch effects; target capture/amplification; mate-pair library prep; ChIP-seq library prep; RIP-seq/CLIP-seq and finally Methylation sequencing.
DNA fragmentation: This section covers all the major options, physical or enzymatic and discusses the importance of making sure the insert size is correct for your application. An interesting comment was that the lab has successfully clustered and sequenced libraries with 1500bp inserts!
Complexity, bias, and batch effects: The section discusses the importance of understanding the biases in the methods being used and of good experimental design. The use of duplicate reads to measure library complexity covers the important points of read-depth and sampling error. And the authors present the basic methods as applied to genomes where nucleic acids are present in roughly equimolar ratios; and discuss the caveats of applying the same methods to RNA-seq or ChIP-seq, where they most certainly are not. There is only a brief mention of molecular indexing and the potential impact this has on NGS analysis. The first take home message is to minimise batch effects and PCR – but there is no discussion about the quantification of your final libraries being a good place to make decisions on reducing PCR cycles. The second is that user should “keep in mind the general principle that more starting material means less amplification and thus [usually] better library complexity”, just because a kit can work with 50ng or RNA does not mean you should use 50ng when you can easily get 500ng!
Target capture/amplification: This section covers the major methods for in-solution target capture. But it is a bit light on amplicon sequencing methods, there are many companies out there selling amplicon kits and I’d suggest people look at the Fluidigm Access Array and Wafergen SmartChip for amplicons on Illumina, or even combine Ampliseq with Nextera XT.
Single-cells: The review only briefly mentions single cells and the Fluidigm C1 system. I suspect we will not have long to wait before we start to see similar papers focused on single-cell NGS.
“When in doubt, consulting a statistician during the experimental design process can save an enormous amount of wasted money and time.”
Cellular Research released their latest development: Resolve for single cell mRNA-seq sample-prep at under £1 per cell. A paper in today's Science describes the method: Combinatorial labeling of single cells for gene expression cytometry. Using CytoSeq (why not Cyto-seq) 10,000 or 100,000 cells can be analysed. Like the C1 cells need to be flow sorted, but unlike the C1 CytoSeq does not apply as stringent a restriction on cell size or morphology - if you can sort it, CytoSeq can sequence it. The paper presents data from several hematopoietic systems but solid tissue, e.g. tumours, should be analysable if they can be mechanically or enzymaticly disaggregated.
Who Are Cellular Research: The company was set up by Steve Fodor (hence the Affy link in the title of this post) in 2011 at the same time a PNAS paper first described the molecular indexing approach. I was first alerted to Cellular Research by a contact who'd moved from Fluidigm in 2013. The publication in 2014 of a PNAS paper by Glenn Fu on molecular indexing in RNA-seq showed what Cellular Research might be delivering, and in the past few months we've begun testing of the Precise assay for targeted RNA-seq. The workflow in the lab is great and expected costs are just £10 per sample for up to 130 genes.
Precise assay workflow
How does Cyto-seq work: Cells
in suspension are loaded into 20 picolitre wells, such that most wells
are empty but those that do contain cells only have one.
Oligonucleotide coated beads deliver the molecular index for each cell,
and the molecular indexes for the mRNAs at the same time. mRNAs bind to
the oligos ready for 1st strand cDNA synthesis; and similarly to the
Precise protocol all 10,000 cells are pooled for downstream processing
as a single reaction through reverse transcription, cDNA amplification
and finally sequencing. Figure 1 from the Science paper describes the
first experiment described in the Science paper was a mixture analysis
of K562 (myelogenous leukemia) and Ramos (Burkitt’s lymphoma) cells
using a panel of 12 genes: five genes specific for K562 cells, six genes
specific for Ramos cells, and the common housekeeping gene GAPDH. Other experiments reported against panels of 93, 98 and 111 genes. Not quite whole transcriptome, but only 1-5M reads per experiment makes CytoSeq the first single-cell transcriptome MiSeq application.
The method again uses the power of molecular indexing to tag multiple cDNAs from single cells and apply unique indexes to both mRNAs and the cells they come from. You'll be able to run CytoSeq in your lab from 2016 when Cellular Research will release an instrument to perform the library-prep workflow in cartridges of 5-10,000 cells per run.
Figure 1 from Fu et al 2015.
How much might Cyto-seq cost: The combination of a Resolve cartridge for 10,000 cells at £1 each plus a single MiSeq run at £600 comes to a little over £10,000 for a 100 gene panel. It is not clear how scalable the number of genes is and whole transcriptome may be a ways off yet. But assuming you can do this, and you stick with the 1-3M reads per cell that the major single-cell labs (and Fluidigm) are suggesting then each cell would cost about £3-5 to sequence on HiSeq 2500 today. So a 10,000 cell CytoSeq total mRNA-seq experiment would cost £6000 for library prep and £30,000+ for sequencing (price per M reads here). Not cheap, but the impact on some biological questions will be impressive, and new questions can be asked if we can do this kind of work routinely.
Clincal applications of CytoSeq: Will the method be applicable to blood cancers as a new screening tool? Will gene expression analysis of disaggregated solid tumours be possible in real time and at a cost that can make an impact on patient care? I am sure people are already working on these kind of questions.
Why is molecular indexing important: I think molecular indexing is a big leap forward for NGS. Being able to clearly identify single-molecules on an Illumina sequencer means the need to develop single molecule sequencers is significantly lessened for most of us. Molecular indexes should allow us to reduce the impact of technical artifacts in PCR amplification and resolve copy-number amplifications and deletions, mRNA DGA, Chromatin binding peaks, and exome allele calls much better than we can today.
Updated: The cost per M reads and cost per Gb figures in the original posting were wrong - damned Excel operator error! I've fixed them, again. Thanks to Shawn for his comments.
I've been asked about the difference between the Illumina sequencer line-up so many times that I put together a spreadsheet to help the discussions. This is cobbled together from the Illumina website and there are no prices quoted, however I have estimated the £ per M reads and the £ per GB.
The amount of RNA used in an RNA-seq library prep is often listed as a competitive advantage by kit manufacturers. As late as 2004 I used up to 30ug of RNA in a microarray prep, and even a couple of years ago 100ng was considered "low". Nowadays kits are available for picogram quantities, but have you ever considered how much of the total RNA you measure is actually going to be informative?
The answer is not a lot! Stopping to think about this is important as the amount of something in a sample directly corresponds to how easily we can measure it. Wendell Jones (Global Head of Bioinformatics at Expression Analysis), gave a great talk at the recent RNA-seq Europe meeting, where he discussed the relative abundance of different RNAs and the ease (or not) of measuring these on different gene expression platforms. He kindly gave me a copy of his slide deck and I've used this as the basis of my figures below.
RNA QT/QC: We often measure RNA quantity with Ribogreen and quality using the Bioanalyser. When you take a look at a atypical Bioanalyser trace you'll see two major peaks from the 18S and 28S ribosomal RNAs, the ratio of which is used to calculate the RIN. It should be obvious to all that what we see on the Bioanalyser are two stonking great peaks from just two rRNAs, and these, usually unimportant from our perspective, account for a large portion of the total RNA in our Eppendorfs.
What is total RNA: The RNA we get after a total RNA extraction is a complex mix of millions of transcripts. However it is also a mix that is dominated by a very few species: tRNA, rRNA and some very highly expressed transcripts (e.g. Globin or Rubisco). The RNAs we are usually interested in are expressed at very low levels compared to these, and at first glance at the figure below you might just wonder how we measure any of them at all! This is because the most abundant RNAs, namely tRNA and rRNA are uninteresting to most scientists and we usually enrich for mRNA/ncRNA which are in the bottom 5% (by abundance).
If you look at a typical Bland-Altman plot (below) you'll see how the spread of gene expression data (e.g. two replicates) increases at lower RNA expression levels due to measurement noise. This was simplified in Wendel's presentation so we can more easily see there is a point at which we move from quantitation of transcripts to detection. The line is of course artificial and where you consider this should be drawn will depend on many things.
RNA-seq Bland-Altman plot
We can take advantage of this and get flexibility in the dynamic range of our experiments by sequencing to different depths (usual), and/or increasing replicates (preferable).
How does RNA-seq compare to other DGE methods: Wendel presented a great slide where he compared where the detection/quantitation boundary lies for multiple differential gene expression technologies. Most of us are rarely going to go past 20 or 50M reads for differential gene expression, so qPCR is still looking like a tool we'll be using for many years to come. How it competes against some of the newer targeted RNA-seq assays will be interesting to see, and the impact molecular barcodes will have on true transcript abundance measurements is going to give newer RNA-seq methods an edge over current ones.
How low can you go: It is important to remember that while there are methods that can work with incredibly low amounts of RNA, including single-cell RNA-seq, the lower your inputs go, the less chance you have of sequencing the RNAs you might be interested in; especially if they are low-abundance transcripts. Sampling error is something you really need to understand before dropping inputs down way low. In a microarray experiment we clearly showed that reduced RNA input had a clear impact on detection sensitivity, but there was no impact on specificity. Even at low inputs when we saw differential gene expression, the results were accurate -see Lynch et al - The cost of reducing starting RNA quantity for Illumina BeadArrays. The same is (hopefully) going to be true for RNA-seq. For single-cells it will be interesting to see what the community decides is the right read-typ and read-depth to use - I'd be surprised if we go above 10M reads, and we might prefer to use 384 cells with just 1M reads each.
Acknowledgements: Thanks very much to Wendell for sharing these slides.
"We live in an interesting age" or so the quote goes. It is very interesting that BGI are going to start selling sequencing technology. The coverage of Jun Wang's (BGI CEO) JP Morgan presentation by GenomeWeb lays out what they plan to do to start actively competing with Illumina: exome sequencing, NIPT on BGISEQ-1000 and now new sequencers that will compete directly with HiSeq and MiSeq.
I am not sure, but would hazard a guess that BGI is still one of Illumina's largest customers for instruments and reagents. It must be an interesting relationship to manage from both sides!