Monday, 21 July 2014

1st Altmetric conference - Sept 25/26th in London

I've been a user of Altmetric for a while now and very much like what they are doing with article metrics. I'm sure many Core Genomics readers will also have seen the Altmetric badge on their own papers. Now Altmetric are hosting their first conference.


The meeting aims to demonstrate how users are integrating Altmetric tools into their processes. Hopefully they'll cover lots of interesting topics and spend some time talking about how the community can keep tools like Altmetric from becoming devalued by gaming.

Might see you there...

Thursday, 17 July 2014

A hint at the genomes impact on our social lives

GWAS is still in the news and still finding hits, the number of GWAS hits has increased rapidly since the first publication for AMD in 2005. Watch the movie to see the last decade of work!




A recent paper in PNAS seems to have got people talking: in Friendship and natural selection Nicholas Christakis and James Fowler describe their analysis of the Framingham Heart Study (FHS) data; specifically the data of people recorded as friends by participants. The FHS recorded lots of information about relatives (parents, spouses, siblings, children), but also asked participants “please tell us the name of a close friend". Some of those friends were also participants and it is this data the paper used to determine a kinship coefficient, higher values indicate that two individuals share a greater number of genotypes (homophily.

The study has generated a lot of interest and news (GenomeWeb, BBC, Altmetric) but also some negative comments, mainly about how difficult this is to prove in a study where you cannot rule out genetic relationships individuals themselves don't know exist (i.e. I don't know who my third cousins are and might make friends by chance).

The data in supplemental files from PNAS paper show Manhattan plot (top) for the identified loci, its not as stunning an example as you'd see in other fields. Compare it to a well characterised GWAS hit from a replicated study in Ovarian Cancer (bottom).








Monday, 14 July 2014

Sequencing exomes: what sort of read to use?

What's the right way to sequence an exome? We've been looking at Illumina's v4 chemistry for HiSeq 2500 and wondering whether we should jump to PE125bp or not, or should we try to reconfigure our exome capture for shorter or longer fragments.

Exome-seq: Exomes have been a big hit, there are currently over 3000 publications in PubMed with the search term "exome". Given that the first in-solution exome paper was only published in 2009 that's pretty amazing, but then again the exome is an amazing research tool.

Note to readers: This post started out as a writing down my thoughts about whether we should move to longer reads for exomes. But it has become a bit more rambling as I started to find out I need to do some mroe digging. I may well come back to this post with an update or version two...

There are many ways to prepare an exome for sequencing and in my lab we're currently using Illumina's rapid exome kit. We're also about to compare this to Agilent's new SureSelect QXT kit which is a direct competitor to Illumina's Nextera-based offering. But we've never tried Nimblegen or AmpliSeq, however this post is more about how to sequence the exome than prepare it so enough of kit comparisons.

The standard exome: Their are two things you need to consider when sequencing exomes: read depth and read-length. I'm not going to worry about depth in this post, and instead I'm going to focus on read-length. Today most labs appear to be running exomes at PE75bp, a standard which I am not sure has ever been agreed by anyone, but it has been accepted as being good-enough for most projects (Illumina recommend PE75-100). I know of some groups that moved over to PE100 to simplify lab logistics as much as anything else, but I am not clear that there are significant benefits to increasing length so we've stuck at PE75 for the time being.

Are longer reads better: With the advent of v4 chemistry on HiSeq 2500 we should be able to generate high-quality paired-end 125bp reads, albeit with a slightly higher error rate at the end of the read. At first glance this additional data seems too good to ignore, especially when Illumina do not sell a 150cycle SBS kit, and 3x50cycle SBS would not be that much cheaper (and more hassle for my lab staff!) By my reckoning PE75 costs £900 per lane whilst PE125 is £1200, or £300 for an extra 100bp of coverage. So if cost does not prevent us using PE125, should we simply switch?

Insert size vs read-length: As you can see below the average distribution of exome fragments size spans the read-length of the sequencer. The solid black line indicates 150bp (PE75): everything to the left of this will be fragments sequenced with an overlapping reads (opes), whilst everything to the right is sequenced with non-overlapping reads (nopes). As read-length increases the percentage of fragments sequenced with an overlap also increases, at PE100 (dashed line)  this is over 50% of reads, and at PE125 (dotted line) it's about 75% of all fragments. An overlapping read creates some issues as the two reads are not independent, tools need to take the overlap into account when calculating on-target coverage, etc; but it also offers the opportunity to increase variant calling quality by increasing Q-scores in the overlap region.



Exome libraries may not be the best size for sequencing: If a non-overlapping read is the best kind to generate then we may need to reconfigure library prep in the light of v4 chemistry. An interesting comparison can be made to the Agilent Bioanalyser trace below the computed insert size distribution. If you overlay and rescale the two images, then the Agilent trace appears to be peak-shifted to larger fragments, and the right-hand fragment distribution is much broader. This appears to demonstrate the preference of clustering:sequencing for shorter fragments.

Exome libraries are probably the best size for capturing exons: The average exon length in the Human genome is 170 bp with 80–85% exons less than 200bp (Zhu et al & Sakharkar et al) so the 185bp average fragment length seems almost ideal.

http://iospress.metapress.com/content/x40k2kuge74488kf/fulltext.pdf
Table reproduced from Shkharkar et al 2004

So what's the sweetspot for Exome capture and sequencing: The simple answer is I don't know, and several factors are likely to affect this. As we increase read-length we'll get more fragments with overlapping reads that could be wasteful; the same happens if we decrease fragment size so longer reads give us more and more overlap with higher quality. But unless there are tools to make use of this the data are redundant. So fragments should not be longer than reads.

But fragments are captured by probes of 95bp so we should probably not make fragments shorter than probes.

Exome capture kits contain blocking oligos to prevent adapter:adapter hybridisation and off-target pull-down. As fragment length increases then the amount of near-target sequence captured may increase meaning we should not make fragments too long. A long fragment risks too much off-target enrichment by the secondary capture of off-target fragments.

Lastly (for now) we'd like to be able to use independent fragments for our analysis so read-pairs might be better replaced with longer single-reads, but twice as many. So perhaps the answer is probes that efficiently capture exons with little or no fragment:fragment hybridisation, coupled to single-end 185bp sequencing with low error-rate across the reads.

Monday, 7 July 2014

Anatomy of a NextSeq flowcell


Personally I'm thinking that the aluminium plate might make a pretty nifty bottle opener!

Thursday, 3 July 2014

How to find the best papers to read is tough

We've all been there: PubMed returns over 2500 RNA-seq papers, and there's still 800 left when you only search the title! How do you find the best papers to read? PubMed can help a little more with your quest to find out more about RNA-seq as there are just 19 reviews, but it's often primary papers you need to dig into to truly understand what's going on in a field. There are other ways to find out what's a hot paper and I've just started using a relatively new one: the Altmetric explorer.

Before I go any further I will say this is a demo account (thanks Altmetric) and their pricing plans are squarely aimed at institutions. Hopefully they'll find a way to make tools for individuals with perhaps more limited search functionality.

What does Altmetric Explorer do: The search tool allows you to filter the vast amount of data Altmetric has collected, you can even enter a PubMed search directly. The first thing I did was to look at was my own publication record and see who's talking about the papers I've co-authored, turns out it is often just me (as far as Altmetric is concerned)!

I'd originally been in touch with the Altmetric team about using data from ORCID (I wrote about this last year) and seeing if it were possible to pull out more complex relationships between authors. The aim was to make creation of something like the Circos plot below easy to do for any group of individuals ro even institutes. I'm still a long way from doing this but if anyone can offer some help that would be great!



The searches I presented below simply used a PubMed search and list papers in the order of most activity, as recorded by Altmetric. You can filter on lots of other metrics including; keyword, date, journal, etc. Take a look and get in tocuh with the Altmetric team if you'd like to do more.
RNA-seq Altmetric activity:


ChIP-seq Altmetric activity:

My Altmetric activity:
PubMed = Hadfield J[author]

Tuesday, 1 July 2014

Single cell extravaganza

AGBT had lots of presentations on both clinical and single-cell work last February and on Wednesday, February 12th Aviv Regev, of the Broad Institute described her groups experience with the Fluidigm C1 system, and their early work in understanding cell-to-cell communication in the immune system. The results were published in Nature a couple of days ago. A more exciting story for me scientifically was published by Aviv Regev and Bradley Bernstein in Science awhere they describe the use of single-cell sequencing to understand tumour heterogeneity.

The big headlines for me were that just 1M RNA-seq reads were enough to get high-quality gene expression estimates, and that single cells are great: but we’re going to need lots of them to delve deeply into biological systems. Both papers showed a massive loss of data at QC stages, around 1/3rd of cells were lost and only 30% of reads mapped to the transcriptome. Hopefuly both of these are things we can improve in the next few years to make single-cell mRNA-seq even more powerful.

Monday, 23 June 2014

New sequencers from BGI: are they going to take market share from Illumina et al

Everyone knows BGI bought Complete Genomics last year. What is less clear is what BGI's plans are for Complete's platform (a little more at the bottom of this thread). There has also been a bit of buzz about seqeuncers being developed by the genomics institute actually in Beijing. Two recent articles on Chemistry World and Firecebiotechit discuss a newly developed sequencer coming out of the Beijing Institute of Genomics. It may even be in the hands of alpha testers now but my lab is not one of them - BIG: feel free to get in touch!

Thursday, 19 June 2014

V4some: 1TB here we come…

Our HiSeq 2500 v4 validation runs are just about to finish and I thought I’d share some details. Ideally I’d give you access to the runs so you can dig around yourselves but until Illumina makes this possible on a per lane basis in BaseSpace you’ll have to make do with my plots.