Thursday, 28 August 2014

SEQC kills microarrays: not quite

I've been working with microarrays since 2000 and ever since RNA-seq came on the scene the writing has been on the wall. RNA-seq has so many advantages over arrays that we've been recommending them as the best way to generate differential gene expression data for a number of years. However the cost, and lack of maturity in analysis meant we still ran over 1000 arrays in 2013, but it looks like 2014 might be the end of the line. RIP: microarrays.

A slew of RNA-seq papers were published in Nature Biotechnology, see the list and abstracts at NextGenSeek and watch for posts on the papers. The RNA-seq blog briefly describes the main findings from the ABRF paper. I thought I'd try and summarise them in some detail here as I am one of the authors on the SEQC paper.

How should I run RNA-seq: Two things are missing for me, and they are some of the first questions I get asked about RNA-seq studies. What depth & length should I use for differential gene expression. Probably 80% of the RNA-seq we run is for people looking at DGE, so I'm a big fan of trying to make this as efficient as possible. Our own analysis shows that 10-20M single0end 50bp reads is fine for most applications. What do SEQC and ABRF say?

Firstly SEQC buries the read length in the paper (its all PE100), secondly they did not do a comparative analysis for DGE using SE vs PE or of different lengths. ABRF used PE50 reads, but do not state figures specific for DGE, and again don't clearly answer the type/length question. SEQC stated that 5 million reads compares well to Affy U133 arrays, 10M reads finds most strongly expressed transcripts and shows good results across sites. But SEQC recommends 5-50M reads for "simple gene-level expression profiling". In my lab today with library prep costing £70 and 1M SE50 reads costing £2.50, a 24 sample RNA-seq experiment would be £2000-5000, that's the same as running one experiment instead of two!

Strandedness: SEQC and ABRF both used older non-stranded RNA-seq methods for Illumina. Interestingly they reported that over 5000 genes were not detected on Illumina but were found on SOLiD using a stranded protocol. I think this goes to show how important strand information might be to users. 

SEQC provide a micro-literature review for RNA-seq, which is helpful in trying to understand what has gone before.I'd certainly point people to this and will spend some time reading these references myself. I like reviews and a systematic one can highlight some important concepts - look for Supplementary Table 1: Literature overview.

Choice of RNA controls: Both papers used the MAQC RNA-samples, in hindsight these may be a poor choice since they are admixtures of individual patient RNA samples. This makes their suitability for demonstrating how wel RNA-seq works for allele-specific expression pretty much useless. Something clearly lacking for me is a discussion about allele-specific analysis. This was perhaps an oversight in the initial designs of both experiments, but feels obvious in its absence from discussion.

SEQC an overview: The FDA led the SEqeuncing Quality Control project which borrowed heavily from the landmark MAQC study. SEQC was an international effort, with authors from the USA, China, UK, Australia, Germany, Spain, Austria, Denmark, Italy, Japan, Belgium, Finland, India, Netherlands, Switzerland, France and Saudi Arabia.

SEQC generated 12 billion mapped HiSeq reads, and over 100 billion reads or 10Tb of RNA-seq data in total! Similar to other comparison studies, SEQC found that no single platform could truly be considered a gold-standard. However good correlations were reported for data generated across multiple sites, and comparisons of RNA-seq to arrays showed that although there is an obvious read-depth dependency for sensitivity, ultimately RNA-seq is more sensitive than arrays. RNA-seq also allows the study of novel isoforms with good agreement between platforms, and even junctions with only 10 reads validated quite well.

SEQC reports high correlations between HiSeq 2000 and SOLiD. It is not unfair to ask the question why bother, given that SOLiD is dead. Unfortunately the time taken for a project of the scale of SEQC to be completed is far greater thn the time to release a new system. In the time since SEQC was conceived we've gone from GAIIx (40Gb), through HiSeq (100Gb), HiSeq v3 (300Gb), HiSeq v4 (500Gb) and X-Ten (1Tb)!

As the SEQC samples were prepped and sequenced at multiple sites it is reasonable to use intra-sample comparisons to try and understand the false positive rate. The result was under 1.5% FDR.

ABRF an overview: The Association of Biomolecular Resource Facilities performed RNA-seq using poly-A, ribo-minus, size-selected and degraded-RNA protocols across 15 labs on five sequencers. They reported good intra- and inter-platform concordance for differential gene expression. ABRF specifically aimed to address the impact of read depth, type and length on RNA-seq analysis; they reported the effect on depth showing that more reads finds more genes, and slso showed how type and length affect splicing analysis. But I did not find a nice simple recommendation of what is best for simple DGE.

ABRF were quicker to complete and publish their study as data were generated on HiSeq 2500 and PGM/Proton vs SEQCs HiSeq 2000 and SOLiD. They also looked at degraded RNA, and show that poly-A–enriched, ribo-depleted RNA libraries, and even severely degraded RNA libraries can be comparable.

Proton, PGM and 454 platforms detected more splice site with fewer bases than Illumina's HiSeq. However, when they compared the effect of long read-length in Illumina by running the same library on HiSeq/MiSeq thay saw a boost in junction identification. Without stating the obvious longer-reads find more splice junctions - less obvious is that they should also be less susceptible to analysis artifact's as two splice junctions occurring in a single read is real evidence vs a bioinformatic assembly.

In summary: Id really like to see a good standard emerge for differential gene expression, that allows users to be confident in their choice of read depth, type and length (personally I stand by 10-20M SE50 reads and 4+ replicates). I think these, and other papers give us the data but not the simple statements. Perhaps a meta-analysis to look purely at DGE.

Perhaps the thing I struggle with most is one of the sentences in the SEQC discussion "Combining deep RNA-seq for alternative transcript discovery with modern high-resolution microarrays for genome-scale quantification may provide an efficient approach for systematic transcript-level expression profiling"; to suggest that we should combine microarrays (and by doing so continue to develop the technology) seems contrary to what sequencing promises to deliver. More and significantly longer reads are coming. I've made the point that once HiSeq generates 300, 600 or possibly even 1000bp reads then for most transcripts we probably have enough information to discover just about everything. For me a future without arrays for RNA analysis seems bright enough.

1 comment:

  1. For what it's worth I totally agree with 10-20M single end 50bp reads with a stranded protocol for DGE. Nice to see someone else state the same. Given a fixed budget its better to have less depth of coverage but allow more biological replicates as biological variability will always be greater than technical. Also greater coverage just means that you detect things with lower expression, and this may not be that useful if you're looking for genes that you can follow up on, for example proof of mechanism biomarkers where you want to ultimately measure protein levels.