CoreGenomics

Monday 23 January 2017

CoreGenomics has moved

Follow this link to Enseqlopedia/coregenomics...

"CoreGenomics is dead...long live CoreGenomics"...the CoreGenomics blog has moved to its new home: http://enseqlopedia.com/coregenomics. Please update your bookmarks and register to follow the new blog, for updates on the NGS map (coming soon), and to access the new Enseqlopedia (coming soon)!

Enseqlopedia: Last year I started the process of building the new Enseqlopedia site, after five years of blogging here on Blogger. Whilst Enseqlopedia is still being developed the CoreGenomics blog has moved over and you can also find all the old content there too. Comenting should be much easier for me to manage so please do give me your feedback directly on the site.

NGS mapped: Currently I'm working on the newest implementation of the Googlemap sequencer map Nick Loman and I put together many years ago. The screenshot of the demo gives you an idea of what's changed. The big differences are a search bar that allows you to select technology providers and/or instruments. The graphics also now give a pie-chart breakdown of the instruments in that location...you can clearly see the dominance of Illumina!

Other technologies that will appear soon will be single-cell systems from the likes of 10X Genomics, Fluidigm, Wafergen, BioRad/Illumina, Dolomite, etc, etc, etc. So users can find people nearby to discuss their experiences with (we're also restarting our beer & pizza nights as a single cell club here in Cambridge so keep an eye out for that on Twitter).

Lastly a change that should also happen in 2017 is the addition of users to the map. I'm hoping to give anyone who uses NGS technologies a way to list their lab, and highlight the techniques they are using. Again the aim is to make it easier for us to find each other and get talking.

Enseqlopedia.com is a big step for me. I hope you think it was worthwhile in a year or so. There's one one feature I've not mentioned until now which I'm hoping you'll get to hear more about in the very near future - the Enseqlopedia itself. Watch out for it to appear in press.

Thanks so much for following this blog. I'm sad to leave Blogger. I hope you'll come with me to Enseqlopedia/coregenomics.

Friday 9 December 2016

10X Genomics updates

We had a seminar form 10X Genomics today to present some of the most recent updates on their systems and chemistry. The new chemistry for single-cell gene expression and the release of a specific single-cell controller show how much effort 10X have placed on single-cell analysis as a driver for the company. Phasing is looking very much the poor cousin right now, but still represents an important method to understand genome organisation, regulation and epigenetics.

Single cell 3'mRNA-seq V2: the most important update from my perspective was that 10X libraries can now be run on HiSeq 4000, rather than just 2500 and NextSeq. This means we can run these alongside our standard sequencing (albeit with a slightly weird run-type).

The new chemistry offers improved sensitivity to detect more genes per cell, improved sensitivity to detect more transcripts per cell, an updated Cell Ranger 1.2 analysis pipeline, and compatibility with all Illumina sequencers - sequencing is still paired-end but read 1 = 26bp for 10X barcode and UMI, Index 1 is the sample barcode, read 2 = the cDNA reading back to the polyA tail.

It is really important in all the single-cell systems to carefully prepare and count cells before starting. You MUST have a single-cell suspension and load 100-2000 cells per microlitre in a volume of 33.8ul. This means counting cells is going to be very important as the concentration loaded affects the number of cells ultimately sequenced, and also the doublet rate. Counting cells can be highly variable; 10X recommend using a haemocytometer or a Life Tech Countess. Adherent cells need to be trypsinsed and filtered using a Flowmi cell strainer or similar. Dead cells, and/or lysed cells, can confuse analysis by leaching RNA into the cell suspension - it may be possible to detect this by monitoring the level of background transcription across cell barcodes. The interpretation of QC plots provided by 10X is likely to be very important but there are not many examples of these plots out there yet so users need to talk to each other.

There is a reported doublet rate per 1000 cells of 0.8%, which keeps 10X at the low end of doublet rates on single-cell systems. However it is still not clear exactly what the impact is of this on the different types of experiment we're being asked to help with. I suspect we'll see more publications on the impact of doublet rate, and analysis tools to detect and fix theses problems.

The sequencing per cell is very much dependant on what your question is. 10X recommend 50,000 reads per cell, which should detect 1200 transcripts in BMCs, or 6000 in HEK293 cells. It is not completely clear how much additional depth will increase genes detected before you reach saturation, but it is not worth going much past 150,000 reads per cell.

1 million single-cells: 10X also presented a 3D tSNE plot of the recently released 1 million cell experiment. This was an analysis of E18 mouse cortex, hippocampus, and ventricular zone. The 1 million single-cells were processed as 136 libraries across 17 Chromium chips, and 4 HiSeq 4000 flowcells. This work was completed by one person in one week - it is amazing to think how quickly single-cell experiments have grown from 100s to 1000s of cells, and become so simple to do.

Additional sequencing underway to reach ~20,000 reads per cell. All raw and processed data will be released without restrictions.

The number of cells required to detect a population is still something that people are working on. The 1 million cell dataset is probably going to help the community by delivering a rich dataset that users can analyse and test new computational methods on.

What's next from 10X: A new assay coming in Spring 2017 is for Single Cell V(D)J sequencing, enabling high-definition immune cell profiling.

The seminar was well attended showing how much interest there is in single-cell methods. Questions during and after the seminar included the costs of running single-cell experiments, the use of spike-ins (e.g. ERCC, SIRV, Sequins), working with nuclei, etc.

In answering the question about working with nuclei 10X said "we tried and it is quite difficult"...the main difficulty was the lysis of single-nuclei in the gel droplets. Whilst we might not be able to get it at single-cell resolution, this difficulty in lysing the nucleus rather than the cell might possibly be a way to measure and compare nuclear versus cytoplasmic transcripts.

Thursday 17 November 2016

MinION: 500kb reads and counting

A couple of Tweets today point to the amazing lengths Oxford Nanopores MinION sequencer is capable of generating - over 400kb!

Dominik Handler Tweeted a plot showing read distribution from a run . In replies following the Tweet he describes the DNA handling as involving "no tricks, just very careful DNA isolation and no, really no pipetting (ok 2x pipetting required)".

and Martin Smith Tweeted an even longer read, almost 500kb in length...

Exactly how easily we'll all see similar read lengths is unclear, but it is going to be hugely dependant on the sample and probably having "green fingers" as well.

Here's Dominics gel...

Wednesday 9 November 2016

Unintended consequences of NGS-base NIPT?

The UK recently approved an NIPT test to screen high risk pregnancies for foetal trisomy 21, 13, or 18 after the current primary screening test, and in place of amniocentesis (following on from the results of the RAPID study). I am 100% in favour of this kind of testing and 100% in favour of individuals, or couples, making the choice of what to do with the results. But what are the consequences of this kind of testing and where do we go in a world where cfDNA foetal genomes are possible?

I decided to write this post after watching "A world Without Downs", a documentary on BBC2 that was presented by Sally Phillips (of Bridget Jones fame), mother to Olly who has Down's syndrome. She presented a program where the case for the test was made (just), but the programme was very clearly pro-Down's. Although not quite to the point of being anti-choice.

Does the world have too many HiSeq X Tens?

Illumina stock dropped 25% after a hammering by the stock market with their recent announcements that Q3 revenues would be 3.4% lower than expected at just $607 million. This makes Illumina a much more attractive acquisition (although I doubt this summers rumours of a Thermo bid had any substance), and also makes a lot of people ask the question "why?"

The reasons given for the shortfall were "a larger than anticipated year-over-year decline in high-throughput sequencing instruments" i.e. Illumina sold fewer sequencers than it expected to. It is difficult to turn these revenue figures and statements into the number of HiSeq 2500's, 4000's or X's that Illumina missed it's internal forecasts by, but according to Francis de Souza Illumina "closed one less X deal than anticipated" - although he did not say if this was an X5, X10 or X30! Perhaps more telling was that de Souza was quoted saying that "[Illumina was not counting on a continuing increase in new sequencer sales]"...so is the market full to bursting?

Controlling for bisulfite conversion efficiency with a 1% Lamda spike-in

The use of DNA methylation analysis by NGS has become a standard tool in many labs. In a project design discussion we had today somebody mentioned the use of a control for bisulfite conversion efficiency that I'd missed, as its such a simple one I thought I'd briefly mention it here. In their PLoS Genet 2013 paper, Shirane et al from Kyushu University spiked-in unmethylated lambda phage DNA (Promega) to control for, and check, the C/T conversion rate was greater than 99%.

The bisulfite conversion of cytosine bases to uracils, by deamination of unmethylated cytosine (as shown above) is the gold standard for methylation analysis.

SIRVs: RNA-seq controls from @Lexogen

This article was commissioned by Lexogen GmbH.

My lab has been performing RNA-seq for many years, and is currently building new services around single-cell RNA-seq. Fluidigm’s C1, academic efforts such as Drop-seq and inDrop, and commercial platforms from 10X Genomics, Dolomite Bio, Wafergen, Illumina/BioRad, RainDance and others makes establishing the technology in your lab relatively simple. However the data being generated can be difficult to analyse and so we’ve been looking carefully at the controls we use, or should be using, for single-cell, and standard, RNA-seq experiments. The three platforms I’m considering are the Lexogen SIRVs (Spike-In RNA Variants), or SEQUINs, or ERCC 2.0 (External RNA Controls Consortium) controls. All are based on synthetically produced RNAs that aim to mimic complexities of the transcriptome: Lexogen’s SIRVs are the only controls that are currently available commercially; ERCC 2.0 is a developing standard (Lexogen is one of the groups contributing to the discussion), and SEQUINs for RNA and DNA were only recently published in Nature Methods.

You can win a free lane of HiSeq 2500 sequencing of your own RNA-seq libraries (with SIRVs of course) by applying for the Lexogen Research Award

Lexogen’s SIRVs are probably the most complex controls available on the market today as they are designed to assess alternative splicing, alternative transcription start and end sites, overlapping genes, and antisense transcription. They consist of seven artificial genes in-vitro transcribed as multiple (6-18) isoforms to generate a total of 69 transcripts. Each has a 5’triphosphate and a 30nt poly(A)-tail, enabling both mRNA-Seq and TotalRNA-seq methods. Transcripts vary from 191 to 2528nt long and have variable (30-50%) GC-content.

Want to know more: Lexogen are hosting a webinar to describe SIRVs in more detail on October 19th: Controlling RNA-seq experiments using spike-in RNA variants. They have also uploaded a manuscript to BioRxiv that describes the evaluation of SIRVs and provides links to the underlying RNA-Seq data. As a Bioinformatician you might want to download this data set and evaluate the SIRV reads yourself. Or read about how SIRVs are being used in single-cell RNA seq in the latest paper from Sarah Teichmann’s group at EBI/Sanger.

Before diving into a more in-depth description of the Lexogen SIRVs, and how we might be using them in our standard and/or single-cell RNA-seq studies, I thought I’d start with a bit of a historical overview of how RNA controls came about...and that means going back to the days when microarrays were the tool of choice and NGS had yet to be invented!

Batch effects in scRNA-seq: to E or not to E(RCC spike-in)

At the recent Wellcome Trust conference on Single Cell Genomics (Twitter #scgen16) there was a great talk (her slides are online) from Stephanie Hicks in the @irrizarry group (Department of Biostatistics and Computational Biology at Dana-Farber Cancer Institute). Stephanie was talking about the recent work she's been doing looking at batch effects in single-cell data, all of which you can read about in her paper is on the BioRxiv: On the widespread and critical impact of systematic bias and batch effects in single-cell RNA-Seq data. You can also read about this paper over at NExtGenSeek.

Adapted from Figure 1 in Hicks et al.

Clinical trials using ctDNA

DeciBio have a great interactive Tableau dashboard which you can use to browse and filter their analysis of 97 “laboratory biomarker analysis” ImmunOncolgy clinical trials; see: Diagnostic Biomarkers for Cancer Immunotherapy – Moving Beyond PD-L1. The raw data comes from ClinicalTrials.gov where you can specify a "ctDNA" search and get back 50 trials, 40 of which are open.

Two of these trails are happening in the UK. Investigators at The Royal Marsden are looking to measure the presence or absence of ctDNA post CRT in EMVI-positive rectal cancer. And Astra Zeneca are looking for ctDNA as a secondary outcome to obtain a preliminary assessment of safety and efficacy of AZD0156 and its activity in tumours by evaluation of the total amount of ctDNA.

You can also specify your own search terms and get back lists of trials from OpenTrials which went live very recently. The Marsden's ctDNA trials above is currently listed.

You can use the DeciBio dashboard on their site. In the example below I filtered for trials using ctDNA analysis and came up with 7 results:

Thanks to DecBio's Andrew Aijian for the analysis, dashboard and commentary. And to OpenTrials for making this kind of data open and accessible.

Friday 7 October 2016

Index mis-assignment to Illumina's PhiX control

Multiplexing is the default option for most of the work being carried out in my lab, and it is one of the reasons Illumina has been so successful. Rather than the one-sample-per-lane we used to run when a GA1 generated only a few million reads per lane, we can now run a 24 sample RNA-seq experiment in one HiSeq 4000 lane and expect to get back 10-20M reads per sample. For almost anything other than genomes multiplexed sequencing is the norm.

But index sequencing can go wrong, and this can and does happen even before anything gets on the sequencer. We noticed that PhiX has been turning up in demultiplexed sample Fastq. PhiX does not carry a sample index index so something is going wrong! What's happening? Is this a problem for indexing and multiplexing in general on NGS platforms? These were the questions I have recently been digging into after our move from HiSeq 2500 to HiSeq 4000. In this post I'll describe what we've seen with mis-assignment of sample indexes to PhiX. And I'll review some of the literature that clearly pointed out the issue - in particular I'll refer to Jeff Hussmann's PhD thesis from 2015.

The problem of index mis-assignment to PhiX can be safely ignored, or easily fixed (so you could stop reading now). But understanding it has made me realise that index mis-assignment between samples is an issue we don not know enough about - and that the tools we're using may not be quote up to the job (but I'll not cover this in depth in this post).