Wednesday, 23 April 2014

Nextera failures fixed by SPRI-cleanup

I'm continuing a series of posts that will highlight small advances we've made in the lab that are unlikely to be published in their own right, this time its the turn of SPRI-bead clean-ups and how we used them to rescue some samples that would not tagment.

The problem: We had a user submit a set of samples for Illumina's rapid exome sequencing in our library prep service. These samples were processed alongside others but the tagmentation for this one user was pretty awful. Most samples dropped out entirely and ony a few generated something that looked as if it had been tagmented at all. The figure below shows tagmentation before SPRI-clean-up and you can clearly see the control (inset) has tagmented and nothing is visible in the second lane where the users samples hold have been. You might also notice that the control is significantly under-tagmented we observed that our control DNA was too high in concentration (15ng/ul instead of 5ng/ul) hence the under tagmentation. Oops!

What was wrong: we tested the sensible things like nucleic acid quantity and quality but there appeared to be enough DNA and its quality was fine. We asked the user to perform a standard PCR in their own lab and that came back positive so we appeared to have something uniquely inhibiting tagmentation.

The solution: We decided that a SPRI-bead clean-up of the DNA before Nextera library prep would be worth testing and this proved to fix the samples. The figure below also shows tagmentation after SPRI-clean-up and you can clearly see that both the control (inset) and the user sample have tagmented. These went on to produce exome sequencing libraries.

The protocol:
  1. Made all samples to 40 ul volume
  2. Performed Ampure XP bead clean-up using a 1:1 ratio, 40ul beads and 40ul of DNA, 80% ethanol washes, eluted in 25ul of water
  3. Calculated volumes for Tagmentation (aim for 5ng/ul)
  4. Quantified to check concentration
  5. Performed tagmentation and clean up and ran the Bioanalyser.
  6. Traces match what we should expect following Illumina’s protocol

Improving Nextera kits: We'd really like to see a bead normalisation move to the front of all Nextera kits (all kits possibly). A clean-up before starting would rescue some poor quality samples, and the normalisation should make processes more robust. Even if the bead-based normalisation does not work very well (and it does not) it sould give us a much lower variation in concentration between samples in the library prep. Thus a quick QT and all samples can be normalised to the correct starting concentration much more easily.

Wednesday, 16 April 2014

More of a fizz than a bang: HiSeq 1TB is now available

Illumina launched the much anticipated v4 kits for 1TB (PE125bp) sequencing on HiSeq 2500 today: we're about to jump from 600GB in ten days to 1Tb in six - awesome and awe inspiring!

Peter Barthmaier from Illumina walks you through the V4 kits in this video.

We've been waiting for this "upgrade" for since an Illumina workshop presentation at AGBT in 2011. Recently there has speculation that this upgrade would be released to customers in June, but the availability of the kits comes ahead of any confirmation (for this lab at least) that our machine can make use of them. Be careful if you place an order today that you don't order something you can't use; although I'm sure the returns desk at Illumina's HQ in San Diego will be fine as long as you've got your receipt and everything's in the original packaging ;-).

The v4 kits are available alongside v3 so we can now work out how much the additional data is going to cost, and it's a whopping 35% more per flowcell. Ouch. This is the first upgrade where we've seen an increase in data coming with an almost commensurate increase in cost, not something Illumina customers are likely to be happy with. I don't remember the details of the price increases we saw moving from GAI to GAIIx or HiSeq 300GB to HiSeq 600Gb but I'm sure it was more favourable.

So what does v4 give us: Take a look at the specs on Illumina's website and you can see that V4 gives 33% more reads and 25% more bp than V3. For the application Illumina have uppermost in mind, whole human genomes or similar, this is going to be a big leg up for users. For a lab like mine with lots of counting applications (eg ChIP-seq, RNA-seq) the per read costs of V4 are virtually the same, but the per lane cost is going up and for many users a single lane is enough yet they are going to pay more for essentially the same data. We could try to stock both V3 and V4 but that's just impractical given the unpredictability of usage. 

The numbers add up to be pretty amazing: V4 gives 167% more data and 133% more reads than V3, but costs 138% more per run. I calculated that if you run the system flat out and squeeze as many runs as possible in a month then V4 gives 306% more data and 244% more reads than V3, but will cost 253% more to run due to the much faster run time and therefore more flowcells to run per month!

The death of rapid run mode: V4 also kills rapid run mode from my personal point of view. As long as you don't actually need a Human genome in 27 hours (PE100 run time) then waiting for 6 days and a high-output run is going to be much cheaper. You can run 16 lanes on High-Output at PE100 in 5 days versus 16 lanes on Rapid-Run in essentially the same time, but get 4B reads instead of 2.4B. That's also a lot easier for people in the lab with just one clustering/sequencing cycle per week instead of four.

What's missing: Today there is no 150 cycle kit (for exomes) and no single-read flowcells, so ChIP- and RNA-seq users have to wait. I can't believe these will be far behind so am not sure why a more coordinated release could not have happened. I'm certainly interested to see how much the single-read flowcells will cost, can we get RNA-seq sequencing costs to under $50 per sample (assuming 20M reads each for DGE)? The 2x250bp on Rapid Runs has also been pushed back to the Summer, nearly 300M "500bp" reads at >87% Q30 sounds worth waiting for!

In summary a huge congratulations to everyone at Illumina who helped get 1TB machine out the door, it kicks sand in the face of the 9-stone weaklings of the sequencing world.

But this is a departure from the past where the cost of sequencing dropped at each update if you want low cost per read rather than cost per GB, at "My Price" V3 is actually $0.13 cheaper per M reads; I hope this changes with the single-end kits. Of course all this means we'll be able to run many more samples in a year and   buy a lot more of Illumina's razor blades (more expensive razor blades), and maybe watch that stock price climb over $180?

Monday, 14 April 2014

Turning cancer genomics into an app for your kids phone

Cancer Research UK science has been turned into a game for your mobile: Genes in Space. The aim is to collect space rocks; to do this you need to map a route through the densest regions in space then fly though and catch as many as possible. The great thing is each time you play you're actually navigating through segmented breast-cancer copy-number data from the METABRIC project.

The video has a great "Frankie Goes to Hollywood backing track and the unforgettable strapline "You don't have to wear a labcoat to help cure cancer sooner."

The project was conceived by Carlos Caldas's group here at the Cambridge Institute. In 2013 CRUK science geeks and external computer geeks got together for a weekend of game design. You can see the development in the video below.

Monday, 7 April 2014

Making ChIP-seq a little more robust

I had a fun time at Jason Carroll's group retreat in sunny Cromer a few weeks ago. I was invited to present some work we'd been doing on using Thruplex for ChIP-seq library prep and there was a lot of great science being discussed.
One of the presentations that stood out to me was from Kelly Holmes in Jay's group, she presented some simple steps all users can take to improve the quality of their ChIP-sequencing experiments: Quantify your Chromatin, Check your sonication, Standardise your library prep and Check your libraries. She also summarised a presentation on the use of controls and normalisation spike-in's.

I thought what Kelly had suggested was so simple and likely to have such an impact that I asked if I could share her talk with Core Genomics users, here are her 4 steps to ChIP-seq heaven:

1 Quantify your Chromatin
  • Quantify using nanodrop or similar.
  • Use 2.5ul of each sample after sonication.
  • Standardise the amount of Chromatin in your ChIPs.

2 Check your sonication
  • Run sonicated chromatin on e-gel or similar.
  • Reverse crosslink samples after sonication and before running on gel.
  • Check that majority of chromatin is below 1Kb.

3 Standardise your library prep
  • Use a single protocol across your samples.
  • If size-selecting keep fragment length the same.
  • Don't over-amplify your samples, determine the number of PCR cycles needed to get enough library for sequencing.

4 Check your libraries 
  • Once you've made a ChIP-seq library take some time over its QC an QT to get the very best results back from your sequencing lab.
  • QC and estimate library size using the Bioanalyser.
  • QT with qPCR using the estimated fragment size from above.
  • If libraries have different insert sizes then pooling can still be accurate, but fragment sizing is absolutely necessary

Once you've made libraries pool your samples so they are equimolar. We recommend normalising all samples to the same molarity (10-20) then pooling equal volumes to create a final pool at 10-20nM ready to submit and sequence.

Controls and normalisation spike-ins: Kelly also summarised a recent Actif Motif seminar at our Institute. Actif Motif sell ChIP-seq kits and antibodies as well as controls such as their Ready-to-ChIP HeLa Chromatin and qPCR control kits. At their recent seminar on ChIP-seq they presented the use of Drosophila chromatin as a spike-in control for histone mark normalisation: 750ng of drosophila chromatin in 30ug Human chromatin, co-IP for Drosphila specific protein and protein of interest, make ChIP-seq libraries and use Drosophila reads to normalise peaks for the protein of interest between samples.

Tuesday, 1 April 2014

MiSeq X: is this Illumina's HiSeq 2500 replacement

Illumina are probably going to announce their latest sequencer today and Core Genomics was heard about it before anyone else...MiSeq X. Similar in name to the HiSeq X Ten the MiSeq X borrows many of the improvements seen in it's big brother including new optics, improved chemistry and a stand-alone server for data analysis allowing upload to BaseSpace. There are fewer restrictions about what we can do on MiSeq X, and the "only Human" has gone.
  • You can buy a single MiSeq X unlike the minimum order of ten HiSeq X's
  • Runs are currently limited to fragment sizes of 350bp but users are not restricted to the TruSeq Nano DNA HT kit, any 350bp library is OK
  • Read length is limited to Paired-End 150bp in the first instance
  • Expect 150M reads per flowcell

What does this mean for HiSeq 2500: It's pretty cool to have a MiSeq that can generate as much data per lane as a HiSeq rapid run, and this creates an interesting problem for potential HiSeq owners. The new MiSeq X uses the same patterned flowcell, chemistry and software advances as the HiSeq X Ten, wrapped up in the form factor of the desktop MiSeq. With the 2x400bp coming by Christmas and assuming the MiSeq X uses all the space in its lane we could get up to 300-400M reads and well over 100Gb per run.

MiSeq X flowcell (courtesy of PowerPoint)

Who's buying MiSeq X: The usual suspects have their machines installed and running; an off the record quote from the Broad said "we got their first, we usually do", and the Sanger said "our history with Illumina means we're usually second, but that's so much better than being last". This is going to be the machine for every other lab outside of the Sanger, Broad and the New York Genome centre (who essentially do everything the Broad does anyway) in their "big-boy" HiSeq X Ten club. Even the BGI are buying these "little-boys".

GenomeWeb contacted both Thermo Scientific (LifeTech) and Complete Genomics for their perspectives:
  • Thermo said "bugger, we're going to get our noses rubbed in those PGM vs MiSeq ads (1, 2, & 3)!"
  • Complete simply replied to their email asking for a response with "我们 投降", which broadly translates to "we surrender"!

Thursday, 27 March 2014

Illumina vs LifeTechnologies: the latest instrument comparison

A recent PLoS One paper is perhaps the latest battle in the Illumina vs LifeTech war. The latest paper in PLOS ONE presents a comparison of MiSeq and Proton PI sequencing to detect chromosome abnormalities in self-aborted foetuses: Chen et al: Performance Comparison between Rapid Sequencing Platforms for Ultra-Low Coverage Sequencing Strategy .

The group were aiming to evaluate the use of ultra-low coverage sequencing (ULCS) to identify aneuploidy for potential use in the clinic. They compared data from 18 spontaneous abortions making libraries for both MiSeq and Proton with 50ng of input DNA and and the same 10 PCR cycles in both methods. Materials and methods descriptions of the library preps were refreshingly clear and both data sets have been deposited in the SRA: MiSeq - SRA116521 & Proton - SRA116521. The team used 150bp paired-end sequencing and generated 4.58 million reads in 27 hours on MiSeq (215k reads per sample); and 111bp single-end sequencing, generating 39.33 million reads in 4 hours on Proton (1.7M reads per sample) - almost 8x more data per sample by their calculation! In the analysis they sub-sampled 90k reads randomly from each sample.

Comparing MiSeq and Proton: The group said they saw no significant difference between the two platforms when it came to detecting anueploidy or across several QC's they investigated. The two figures below, reproduced from the paper show anueploidy as called by the two platforms (A) and relative sequence depth (RD) (B) across the chromosomes averaged for all 18 samples. There was a slight difference when it came to calling sex of the foetus, whilst both platforms showed 100% accuracy for the sex determination the MiSeq data was a little tighter, they put this down to the use of paired-end and slightly longer reads but did not test this hypothesis by reanalysing MiSeq as single-end and clipped to 111bp, which would have been pretty simple. There was also a significantly higher duplication rate in the Ion data, and although the extra reads meant this was not a problem in calling aneuploidy, the authors discussed several areas they'd like to see optimised, however I'm not sure the library prep that needs optimising or the emulsion PCR. Another comparison paper by Quail et al improved library-prep on Ion with the use of a different polymerase, however attempts to improve bias seen in ePCR was unsuccessful. Lastly the authors of the UCLS paper did not compare platform error rates, this could be justified as their analysis is alignment based and could easily tolerate lower quality data, but this is likely to be strikingly important for users who are calling SNPs.

Basic stats from the run were presented in a table: however table 2 in the figure appears to double count the MiSeq's paired reads.

So what does this comparison tell us that's useful in Illumina vs LifeTech:  

Round 1 - speed: The two sequencers both run fast: although the paper describes the use of long-read MiSeq data the system can complete a run in as little as 4-5 hours and this is total run time including clustering from a prepared-library. Proton is advertised as having fast run-times and it can complete a run in about 4 hours too, however template prep needs to be done off the instrument and adds time and complexity after library preparation (about 8 hours total run time).

In the paper the authors are somewhat swung by the apparent shorter TAT of Ion Proton, but I think they are mistaken once everything is taken into account. I've no experience of the Ion One-Touch or prep for Ion sequencing except for AmpliSeq, but from what I understand Illumina can certainly keep up with respect to time, especially on short reads. So I'd like to see future comparisons break-down more carefully the time element of different stages in the materials and methods sections. Both systems compete well on speed, round 1's a draw.

Round 2 - read numbers: The major difference in this comparison is the number of reads generated, especially once the PII chips come out (finally, probably in 2014). LifeTech are promising 200M+ reads per chip, Dale Yazuki has a great post comparing NextSeq to Proton PII if you want to hear more (disclosure: Dale works for LifeTech but is a very unbiased blogger). At these read depths Proton certainly looks like it can compete very well in a clinical market where turnaround time and cost per sample are so important. Going by Dale's numbers Proton will generate 200M reads for $1000 $500/100M reads whilst NextSeq will give 400M for $1300 $325/100M reads, making NextSeq not quite half as expensive per read. Looks like Illumina win round 2. Edited after reading Dale's update!

Round 3 - cost per sample: The paper compares costs on $/Gb and $/sample stating: MiSeq $150/Gb & $50/sample and Proton $100/Gb & $15/sample - based on a yield of 17M reads & 5.1Gb from a $750 MiSeq run versus 80M reads & 10Gb from a $1000 Ion P1 run and 18 samples. At their stated requirement of 1M reads per sample MiSeq would remain unchanged, but Proton could complete 2, 3 or even 4 times as many samples. Looks like LifeTech win round 3 but, MISeq is giving a health 25M+ reads per riun today in our lab, as high as 35M so the price per sample is more likely to be much cloaser to the $15/sample of Proton.
Both systems compete well on cost per sample, round 3's a draw.

The dramatic decrease of TAT and cost make sequencing technology comparable or even superior to some existing approaches, such as comparative genomic hybridisation.

Other comparisons: In Quail et al: A tale of three next generation sequencing platforms: comparison of Ion torrent, pacific biosciences and illumina MiSeq sequencers, the authors neglected to include the numbers of reads as one of their comparison metrics. But they found that there were key differences  in data quality of the platforms. In Loman et al: Performance comparison of benchtop high-throughput sequencing platforms, the comparison showed that at that time MiSeq had the highest throughput per run and lowest error rates, whilst the Ion Torrent PGM had the highest throughput when run in 100bp mode but struggled with homopolymers.
Summary: Loman et al compared read numbers across platforms and I think this is a hugely important metric for many applications. Dale's blog focuses on the impact of transcriptome analysis as users running DGE don't need many reads. So rather than the battle for exomes, which LifeTech seem to be aiming for, I think the real war will be over low-read-per-sample methods such as RNA-seq, CNV-seq, ChIP-seq, ATAC-seq, DNAse-seq, smRNA-seq, Repli-seq, etc, etc, etc i.e. plenty of application space to fight over and at a low cost-per sample this means lots of sample-prep sales are possible too.
Will I be buying a Proton, almost certainly not for exomes and genomes, but for everything else - the jury's out. And I'll certainly be taking a closer look once PII chip lands and it delivers on the promise.

Wednesday, 26 March 2014

3TC-seq: differential expression from degraded RNA

Joakim Lundeberg at the Science for Life Laboratory, Stockholm, Sweden has published a nice paper in PLoS One on the impact of RNA degradation on the quality of RNA-seq and differential expression results: Sigurgeirsson, Emanuelsson & Lundeberg: Sequencing Degraded RNA Addressed by 3' Tag Counting. PLOS ONE 2014. They used high-quality cell line RNA degraded to specific RINs by metal hydrolysis to demonstrate that "RIN has systematic effects on gene coverage, false positives in differential expression and the quantification of duplicate reads" and provide a computational method for low RIN DGE analysis that most importantly keeps false positives low. Whilst they demonstrate pretty good sensitivity most users are likely to be affected more by false positives in low quality RNA experiments, they will almost certainly come to the experiment understanding there will be limitations and I find sensitivity is usually traded in place of specificity.

Why does RIN vary: anyone who's extracted RNA has probably run an Agilent BioAnanlyser, or a denaturing agarose gel if you're old enough or don't have a Bioanalyser close by. When you look at RNA on a gel there is often variation in the intensity of the band/smear, 18S and 28S peaks usually dominate in high-quality RNA. Agilent use 9 specific regions of their electropherogram to calculate the RNA Integrity Number (RIN) and this has become the default method for RNA quality assessment. Agilent Technologies used to have an RNA Integrity Database as a repository of Agilent 2100 Bioanalyzer which users could access for free and compare their own results to validated examples from over 650 total RNA runs including human, mouse, rat, plants. Unfortunately it seems to have disappeared, can anyone point me back to it?

Whilst we do try to use samples with high RIN there are often times when this is not the case. Low quality RNA is fine in many applications as long as the user is aware of some caveats: a 3' bias if using oligo-dT priming and the difficulty of comparing sample groups with different RINS.

The experiment: Anyone who's tried to fragment RNA to a defined RIN may well have struggled. RNA is very labile and can easily be turned into RIN 3 or lower but getting a nice distribution can be tough. The group used NEBNext Magnesium RNA Fragmentation reagents and different conditions to achieve RINS of 2, 4, 6, 8 and 10. Figure 2 (below) shows how the 18S & 28S peaks decrease and there is an increase in small RNA products (fragmented RNA) as RNA degrades. They also showed that a degradation temperature of 74C gave a more linear change in RIN than the higher temperatures more commonly used.

They made libraries using TruSeq and sequenced using paired-end 100bp sequencing (I'd point them to an earlier post from this month! Although they did downsample to just 20M reads for DGE analysis.) They marked but did not remove duplicate reads and used HTSeq for counting reads and DESeq for differential expression analysis.

They used only a defined length of each transcript, set at 1500, 1000, 500 and 200bp in their tag counting. Without this then the more degraded RNAs lose gene counts as only their 3' ends are retained compared to high-quality RNAS with full-length (or almost mRNAs) length restriction reduces the number of genes labelled as expressed, i.e. it decreases sensitivity.

Comparison of degraded RNA: The data show very clearly that comparing RIN 10 to 8 results in reasonable numbers of DGE calls. Of course we'd argue very strongly that users should never attempt to do this, especially if groups are confounded by RNA quality. However the use of a variable defined transcript length (200-1500bp) in the tag counting allows them to demonstrate that these false-positive DGE calls can be almost entirely removed maintaining specificity e.g. RIN 10 vs 8 generated 4344 DEGs without length restriction, but DEGs drop to 10% of that figure when a 1500bp restriction is applied and to just 2 at 200bp. Sensitivity remains high until the 200 nt length restriction (sensitivity is the ability of the method to call genes as expressed, see Methods for definition of sensitivity).

Beware of ribosomal reduction in low RIN: It should be obvious but the group show that attempting to use ribosomal depletion methods like RiboMinus on degraded RNA is not generally a good idea. Because the ribosomal RNAs are degraded along with mRNAs only the RNA with homology to the depletion probes will be removed.

Lastly in their discussion the authors make a similar observation to mine: that "the majority of all archived RNA sequence data to date is derived from poly A selection", oligo-dT enrichment of mRNAs works, people understand it and it is a popular method.

Monday, 24 March 2014

5mC-PCR: preserving methylation status during polymerase chain reaction

Methylation analysis is hampered by the simple fact that PCR amplification removes methylation marks from native DNA. We came up with a simple idea to produce a thermo-stable DNA methyltransferase to preserve methylation status through PCR cycles, allowing amplification of DNA and simplified analysis. The first thing we did was approach some enzyme companies to see if anyone had something suitable on their books, they did not. But they did seem to think this was a good idea so we designed a pretty simple experiment to test it: this involved going back to basics, to how PCR was performed before the application of Taq polymerase - we added Dnmt1 after each amplification cycle to copy methylation marks onto the daughter strands.

Demonstrating preservation of 5-mC: As a thermostable Dnmt1 is not commercially available we decided to introduce an additional step to ensure the methylation status is maintained during PCR amplification. We used commercial human DNA methyltransferase (Dnmt1) and the methyl donor S-adenosyl-L-methionine (SAM) after every PCR cycle to copy the methylation marks from the template strand to the newly formed complimentary strand. As the Dnmt1 will degrade at high-temperature and the SAM degrades at neutral pH and high temperatures, we need to add fresh reagents (Dnmt1 and SAM) at 37°C as the last step of each cycle.

We designed a synthetic oligo 122bp long containing a single unmethylated/methylated CpG inserted in a methylation sensitive restriction enzyme site for ClaI.

We set up 50 µl PCR reactions using Phusion HF polymerase, 20 ng of template and the following cycling conditions: 30s at 98°C, then 6 cycles of 10s at 98°C, 10s at 59°C, 10s at 72°C and 20 min at 37°C. Immediately on cooling to 37°C, the program was paused (1 min) for the addition of fresh Dnmt1, BSA and SAM. For convenience/accuracy, BSA (final 100 µg/ml) and SAM (final 160 µM) were premixed with a small amount of Dnmt1 buffer (final 0.05X) so that 1 µl could be added in each cycle.

NEB define 1 unit of Dnmt1 as the amount of enzyme required to catalyse the transfer of 1 pmol of methyl group to poly dI.dC substrate in a total reaction volume of 25 μl in 30 minutes at 37°C. The following amounts were added in the initial experiment:

PCR products were purified using a Zymo Clean & Concentrator Kit, digested using ClaI at 37°C for 30min, purified again and run on a 2% agarose gel:

Results: The final gels show the multiple reactions we set up to demonstrate the plausibility of preserving 5mC during a few cycles of PCR.

  1. Cla1 should have cut this template, but there are cut and uncut bands (incomplete digest?).
  2. Cla1 should have cut this template (the methyl mark was not amplified), but there are cut and uncut bands (probably from the methylated template - success?)
  3. Cla1 should have cut this template, but there is a significant uncut band (de novo activity?)
  4. Cla1 should not have cut this template and there is only an uncut band (success).
  5. Cla1 should not have cut this template (success).
Conclusions: Dnmt-PCR works (lane 3 vs. 4)! It turns out that Dnmt1 may also have some de novo activity, not only the widely accepted maintenance one, so optimisation of the Dnmt1 incubation time/amounts is needed.

According to the literature, the maintenance vs de novo activity of Dnmt1 is in the order of 1-2 magnitudes, so hopefully we can manage to find optimum conditions. e.g. This paper shows that hDnmt1 has about 10% de novo activity, and that its Zn binding N-terminal domain is responsible for preventing the de novo methylation:

Bestor, T. H. Activation of mammalian DNA methyltransferase by cleavage of a Zn binding regulatory domain. EMBO J., 11: 2611–2617, 1992.

Unfortunately this is one of those projects that ran out of steam in our labs. Rather than leaving it to languish in a lab-book I thought I'd write it up here and who knows maybe someone else can push it forward.

Thanks very much to Martin Bachman the PhD student who did most of the work, and Santiago Uribe Lewis the post-doc who thought my idea was a useful enough one for his research on imprinting to take a risk on the project.

Friday, 21 March 2014

Some help with your stats

Stats: not everyone's favourite subject but something we can't avoid so understanding the basics is a very good idea. We're lucky in my Institute having biostatistical support in our Bioinformatics core facility and try to have a statistician with us every time we design a genomics experiment. The same questions come up time and time again, how many samples and how deep to sequence, we're slowly getting answers, but the experience we're building up helps nearly every time. I also find other sources of information can be really helpful and have listed a couple of them below.

Books about stats: You can buy the very useful Lab Math by Dany Spencer Adams, published by Cold Spring Harbour press and available from just £32.69 on Amazon. The book covers the most common mathematical tools you might apply in molecular biology, anyone making up reagents, performing simple statistical tests or working with nucleic acids and proteins is likely to benefit from a quick read through this book.

Stats from Nature Methods: You can now get all 35 Points of View columns in one place thanks to Nature methods and the Methagora blog. I still feel that these could be collected together in a single document as an eBook. I've always liked the format that PoV took, short focused articles that gradually introduce the important concepts in presenting data, I wrote about the series in the Summer of last year.

Now Nature Methods have gone for a similar format but with a focus on stats in the Points of Significance column, which puts statistics in the limelight. Lets face it there's little to be gained from a beautiful or carefully constructed visualisation if the data underneath are crippled by poor statistical analysis.

Stats from BiteSizeBio: a great series of articles by BiteSizeBio author Laura Fulford  can be read together as a good stats introduction - Let’s Talk About Stats: Understanding the Lingo, Comparing Two Sets of Data, Comparing Multiple Datasets, and Getting the Most out of your Multiple Datasets with Post-hocTesting.

Laura covers the language used and makes the very important point that you need to understand this to be able to talk to statisticians, but don’t forget you need to explain you’re language to them too: RNA-seq, exomes, read-depth and single- vs paired-end, are all likely to be a mystery to most statisticians. I particularly liked her coverage of samples size (n), variance and false-positives & -negatives. Laura’s advice is simple “the larger your sample size, the better… [and] as an absolute minimum you need an n of 3 to perform a statistical test”. Variance is important to understand, if you high variance within your test and control groups than making comparisons between them is going to be more difficult; you have “noisy” data. Be careful to check your data are normally distributed, if not then your statistical test might not be appropriate. Lastly Laura makes the point that statistical tests are not perfect, they generate errors and the two most people watch out for are false positives (type 1 errors): where a result looks statistically significant but is not, and false negatives (type 2 errors): where a significant result is missed.

In the second piece Laura provides a simple diagram to help you choose your statistical test.  She describes the commonly used T-test and Mann-Whitney test to find differences between two groups of samples; i.e. A vs B, tumour vs normal, treated vs untreated. If using a t-test then make sure data are continuous, have a normal distribution (or nearly normal) and have equal variance between sample groups. The Mann-Whitney test is used for unpaired samples and does not care how your data are distributed (normal or otherwise), or what the variance is, it is a non-parametric test.

In the last article Laura covers statistical tests suitable when comparing more than two datasets. Again the choice of test is dependent on the design of your experiment, but of course you’ll have included a discussion with a statistician in the design process before you generate any data. Experiments with a single variable then one-way ANOVA might be appropriate, e.g. treated vs untreated for two drugs. Experiments with more than a single variable require different tests, two-way ANOVA e.g. treated vs untreated for two drugs in male and female mice.

Unfortunately these tests for more complex experiments only tell you if there is a statistically significant difference, not what it is, for this you need to do some post-hoc testing. You also need to make sure to consider multiple testing correction especially when applying statistical tests to data like exomes and RNA-seq, without it a p-0.05 is going to throw up a lot of false positives.

Hopefully some of this helps you next time you're deciding how many times to replicate your experiment and thinking about what the variance might be within and between sample groups.

Wednesday, 19 March 2014

Can RNA-seq stop Tour de France dopers?

The BBC ran an article a few weeks ago on the possibility of performance enhancing genetics: think Team BMC Genomics! The piece has an interview with Dr Philippe Moullier from INSERM in Nantes, he was part of a group that published a paper describing "Neo-organ" gene therapy treatment of neuromuscular diseases by the introduction of the erythropoietin gene into mice.

For those of you that easily forget: EPO has a rather bad rap in cycling, just ask the UCI or Lance Armstrong!

Adding EPO to you, and detection with qPCR

Dr Moullier is part of another team that published a real-time PCR method to detect the EPO-transgene in the presence of endogenous sequences: Longevity of rAAV vector and plasmid DNA in blood after intramuscular injection in nonhuman primates: implications for gene doping. Unfortunately any cycling team with the millions of dollars needed to start a GM program can probably design their way around such tests. In the same paper they showed that intramuscular (IM) injection of an EPO plasmid led to detectable levels of DNA in the blood and a "significant, but not life-threatening, increase in haematocrit". The DNA was rapidly eliminated, but the plasmid genomes can persist for several months in WBCs. RO Snyder at the University of Florida, who led the work, has uploaded some slides from the 2013 Gene and Cell Doping Symposium in Beijing. So all this looks possible.

What's this got to do with RNA-seq: why aim to detect a single gene product when athletes can find ways around the tests. Lance Armstrong was not overly sophisticated in his approach, and athletes have more to gain personally than the testing organisations so are probably quite motivated put some effort into their doping.

Instead of the single gene or metabolite test why not try an experiment comparing groups of doping vs non-doping athletes and monitor their blood-based gene expression levels over time. The aim would be to find a gene expression signature for doping in general, or at least for a specific class of doping; EPO vs Steroids for instance.

I'm sure I could help co-ordinate a UCI bid to the new Horizon 2020 EU funding for science: biotechnology or health perhaps, maybe even international cooperation! Anyone fancy putting a grant together for €20M and some shiny new Genesis or Brompton bikes!