Tuesday, May 25, 2010

"The Hottest molbio topics" Series: A scientific spectator's guide to next-generation sequencing

Our poll “The hottest molbio topics: the next few years” is now over and I have invited notorious bloggers (and experts in the fields in question) to discuss and contextualize the results for our readers, in a way to present them in a more attractive way [See Which will be the hottest topic in molecular biology in a few years? The results].

As you may remember, last week David García from You'd Prefer An Argonaute started off the “The hottest molbio topics” Series with his post entitled "The Allure of Regulatory RNAs", a great article on a field which has had an exponential growth over the last few years.

This week, we've invited Dr Keith Robison to contribute to this Series.

Keith spent 10 years at Millennium Pharmaceuticals working with various genomics and proteomics technologies and 2 years at Codon Devices, working on a variety of protein and metabolic engineering projects as well as monitoring a high-throughput gene synthesis facility. He is currently at Infinity Pharmaceuticals.

Also, Keith blogs over at Omics! Omics!.

So, why did people vote for “New DNA sequencing technologies and postgenomics” in our poll? What makes this field so attractive? What are the basics of this fast-growing area of research?

This is what Keith has to say about it:

A scientific spectator's guide to next-generation sequencing

The Vancouver Winter Olympics are over but still fresh in mind. Somehow due to a combination of work and power failures and American television executive myopia, I didn’t see any of my previous favorite event (long track speedskating) nor the women’s figure skating long program, but I did find a new favorite race event (any cross-country ski event). The Winter Games come along only every 4 years but for the most part the events don’t change, so it’s easy to pick up on them each time. However, in order to understand the judged events, it is important to understand how each performance is scored. If you don’t, you may find yourself announcing platinum medals to runners up.

A crowded competitive arena more specific to biologists is the ongoing race to make obsolete whatever sequencing instruments are currently in use. These are also known as second generation and third generation instruments, though nobody can agree on the rules for declaring a system third generation. With several players in the market and new entrants leaping in, it can get confusing. Layer on that, many of the established players have multiple instruments for sale. If you are trying to buy an instrument or purchase services, it’s important to understand the field. Most critically, no instrument is clearly “best” for all applications – each has its strengths and weaknesses.

So, what I’ll try to do here is give you a scorecard of how to grade the players. It’s not as systematic as the new figure skating rules, but there are a lot of angles to look at and I’ll try to touch on the key ones.

First, what are you trying to do? These instruments kick out huge numbers of sequences and have been applied to a wide variety of scientific problems: a complete census of all applications would run many pages and by the time you finished reading it, another application would have been published!
Nevertheless, there are a few major classes of applications. Genome sequencing can be roughly divided into de novo and resequencing, though in the middle are “reference-guided assemblies” (which refers to assembling a new genome using a related one as a rough guide). “Targeted resequencing” uses PCR or hybridization to defined sequences, in order to focus on small portions of a genome, such as coding exons (“exome”).

RNA-Seq, Digital Gene Expression (DGE) profiling and new versions of SAGE all attempt to look at the transcriptome and provide expression information with a sensitivity and precision superior to microarrays. RNA-Seq can also be used to identify new transcript isoforms or mutations in known transcriptomes or to work out novel transcriptomes.

Methyl-Seq reads out the methylation status of the genome. ChIP-Seq is a good example of functional genomics sequencing, in which some functional property (in this case transcription factor binding) is converted to short sequencing tags [See Analyzing the genome-wide chromatin landscape: ChIP-Seq]. There are easily a few dozen XXX-Seq flavors of functional genomics sequencing already. As we’ll see below, applications often change the point weighting for each attribute.

On sheer cost, there are three parameters to consider. First, there is the instrument cost – and not just the machine itself but all the required sample prep hardware and particularly some beefy computers to deal with all the data. Instruments are tending to run from U$250K to U$1M, with easily U$100K-U$1M in additional accessories and computers. A trend lately has been for the established players to announce scaled-down versions of their top-end machines, though none of these seem likely to go below the $250K mark. One reason I’m excited about the new Pacific Biosciences SMRT system is that it appears to be nearly complete out of the U$700K box (as you’ll see, I’m excited about all of the systems, making me the Scott Hamilton of sequencing), though there’s still a need for a compute farm to slurp the biology out of all that data. Another emerging system which is creating a lot of buzz around baseline cost is Ion Torrent’s, both because their technology does not require sophisticated optics (unlike all of the other systems described here) and the instrument itself is proposed to sell for only U$50K. Ion Torrent is also promising that upgrading will no longer require major modifications (or replacement) of the instrument, but rather will all be contained in their consumables kit.

Cost can, however, also be looked at on three other somewhat intertwined metrics: cost per run, cost per sample and cost per nucleotide. Many of these instruments cost U$10K-U$20K in reagents per run. This delivers a ton of data, but limits the number of experiments you can do. The number of samples per run depends on your application. Sequencing a human-sized genome still requires many runs, but many other applications require only a slice of one run. Options for sample multiplexing with embedded sequence barcodes or subdivisions of the instrument surface (via gaskets or separate channels) enable reducing the cost per sample but do nothing for the cost per experiment – though you might find a friend to share with or be able through a service provider or core lab to share with strangers. Finally, there is the cost per base, generally expressed in a cost per human genome sequenced at approximately 40X coverage. To show one example of how these trade off, the new PacBio machine has a great cost per sample (~U$100) and per run (you can run just one sample) but a poor cost per human genome – you’d need around 12,000 of those runs to sequence a human genome (~U$120K). In contrast, one can buy a human genome on the open market for U$50K and sub U$10K genomes will probably be generally available this year.

Different instruments involve different times from end-to-end due to their technologies. SMRT is again a huge winner, with end-to-end times under a day. 454, which was the first instrument of this class on the market, requires less than a day per run but the sample preparation can add a few days to that. On the other hand, instruments such as the ABI SOLiD and Illumina GA series (see here) may have runs lasting almost 2 weeks, plus more days for sample prep. For most research projects, you are not in such a hurry that this is a problem. On the other hand, if you are trying to monitor a new microbial epidemic or pick the right toxic chemotherapy for a cancer patient, only very short runtimes will work.

Sample prep is also a key issue – and one I don’t (yet) have hands on experience. Each manufacturer & their academic fans will take shots at the other systems schemes. This is one attraction of some of the newest systems such as from PacBio – in which sample prep is supposedly very easy. There is also now an active community of companies selling alternative sample preparation methods .

Most of these instruments are characterized by very short read lengths in comparison to the 700-1000+ reads commonly seen with Sanger technology. There has been improvement in getting longer reads, but many are quite short. 454 has demonstrated high quality reads to around 700 and low quality to 1000 and was the king in this regard. Illumina is generally available up to 90 bases and the top centers push that to 120 or 150, whereas ABI SOLiD tops out at 50. Complete Genomics uses really short reads to sequence human genomes, though there is a clever trick there. Polonator uses a similar trick, though not as extensively. Helicos is somewhere in SOLiD’s neighborhood. The new king, though, is PacBio, with reads routinely in excess of a kilobase and often 3-5Kb long. That crown, however, may not be worn long – Life Technologies has announced a technology which in theory has reads limited only by the length of the DNA template. Length is critical to genome sequencing and RNA-seq experiments, but really short reads in huge numbers are what counts for DGE/SAGE and many of the functional tag sequencing methods. Technologies with really long reads tend not to give as many, and with all of them you can always choose a much shorter run to enable the machine to be turned over to another job sooner – if your application doesn’t need long reads.

To make up for the very short actual reads, there are tricks to getting multiple such reads from the same DNA fragment, a strategy generally known as paired reads and more specifically mate-pairs or paired-ends. As suggested by the word “pair”, to date these have involved getting two reads. Illumina actually reads both ends of the input molecule, reflected in nomenclatures such as 2x100. Generally these are nearly symmetric, but some clever folks have schemes which use a short read to tag a fragment and a long read from the other end to get the useful information. Systems such as the Polonator and Complete Genomics use a series of enzyme digestions to create multiple such tags – the sequencing equivalent of double-double and double-double-double jump combinations. There are losses, but in theory this paired tagging could go on for many cycles, unlike skaters who run out of precious momentum. Helicos (“dark fill”) and PacBio (“strobe sequencing), which are both working with single molecules, can generate multiple tags from the same molecule. In addition to ameliorating the issue of short read lengths, all of these multiple tag sequences can provide critical linear information for resolving repeats in both de novo genome sequencing and resequencing, as well as more ability to ascertain the exon combinations in RNA-Seq.

But, that brings up accuracy. Keeping tabs on accuracy is particularly challenging since nobody talks about it unless they are proud of their numbers. SOLiD is pushing an error rate of 10^-6 (phred 60) and is now claiming that sample prep induced errors are starting to dominate actual sequencing errors. Other systems are worse, perhaps routinely delivering phred 20 or less. Getting lots of reads helps with consensus building; the random errors average out (any non-random errors might not, so this isn’t a panacea).

There is also the question of access. If you don’t have the cash to buy an instrument, nearly all of the established systems can be accessed via a service provider. Some of these providers are for-profit institutions while others are university core labs. It pays to shop around, as there is a wide variety of pricing offered, though unfortunately there is not (yet?) a New York Sequencing Exchange to enable facile matching of scientists and providers. An interesting variant on this strategy is Complete Genomics: they perform human genome sequencing as an end-to-end service and do not sell their technology.

Unlike the Olympics, the sequencing game is constantly changing. Several of the systems mentioned above are not yet released, and there are many companies working on either evolutionary or revolutionary new sequencing instruments.

So, going back to the original question that drove this post: why is this all so exciting? As suggested above, it can be viewed as a gigantic scientific competition, but there are some other key reasons to explain why progresses in sequencing technology are capturing such attention. First, the pace of advance is dizzying. Rarely has a scientific field sustained such a technological rush for so long. Second, in the field of genomics these new sequencers are game-changing. What previously took dedicated science factories years to do, can now be performed by small labs; with the availability of outsourced sequencing facilities you can truly sequence a genome from the comfort of your couch. A genome of great interest no longer needs to wait in line and be assigned a priority in a grueling competition. Finally and perhaps most importantly, these instruments are offering opportunities far beyond traditional genomics, even perhaps to perform experiments that have nothing to do with genomics. Our ability to explore many facets of living systems are being expanded by these advances, allowing us to contemplate experiments which before were pure fantasy. Crank your imagination up! The machines are no longer the bottleneck!

(Image credits: 1, 2, 3)

ScienceBlips: vote it up!



Anonymous said...

Very nice summary of the sequencing landscape, though might be worthwhile getting your take about Oxford nano ...
also one of the relaities is that the long read technologies like PacBio and Visigen will generate significantly lower throughput than the short read technologies for the foreseeable future. For e.g. PacBio would generate a 100 Mb or so per run compared to the 100 Gb or more on the short read platforms in 2010. So you would need a 1000 runs on PacBio to equal 1 run on the HiSeqs and SOLiDs.

Kevin said...

ROFL facebook disallowed me to share this url as some users have flagged it as abusive.... like WAT?

Alejandro Montenegro-Montero said...

@Kevin that's weird... I shared it on FB with no problems

Unknown said...

Good article, thanks for tailking so many infor, I thinks embedded sequence barcodes is a good invention.