Bug Bulletin: we have identified a bug that affects indexing when producing gzipped VCFs. This will be fixed in the upcoming 3.2 release; in the meantime you need to reindex gzipped VCFs using Tabix.

HaplotypeCaller Multisample Variant Calling

GrantMarshallGrantMarshall Posts: 6Member

Hey there!

I've been using HaplotypeCaller as part of a new whole genome variant calling pipeline I'm working on and I had a question about the number of samples to use. From my tests, it seems like increasing the number of samples to run HaplotypeCaller on simultaneously improves the accuracy no matter how many samples I add (I've tried 1, 4, and 8 samples at a time). Before I tried 16 samples, I was wondering if you could tell me if there's a point of diminishing returns for adding samples to HaplotypeCaller. It seems like with every sample I add, the time/sample increases, so I don't want to keep adding samples if it's not going to result in an improved call set, but if it does improve the results I'll deal with it and live with the longer run times. I should note that I'm making this pipeline for an experiment where there will be up to 50 individuals, and of those, there are family groups of 3-4 people. If running HaplotypeCaller on all 50 simultaneously would result in the best call set, that's what I'll do. Thanks! (By the way, I love the improvements you made with 2.5!)

  • Grant

Best Answers

Answers

  • GrantMarshallGrantMarshall Posts: 6Member

    Thanks a ton Geraldine,

    This was really helpful. I guess I'll have to experiment a bit more. I'm usually working with around 20x coverage so I was wondering if that 100 sample approximation was with similar coverage. If so, that should work out well for the short term and I look forward to what comes in 2.6!

  • GrantMarshallGrantMarshall Posts: 6Member

    Thank you again for your suggestions. For now it looks like I can just keep increasing sample counts for a while, but if I hit any hiccups I'll tweak those defaults :)

  • GrantMarshallGrantMarshall Posts: 6Member

    I've begun work testing the rate of diminishing returns for my data and I have a question. How do you determine the quality of a call set produced by HaplotypeCaller? I've noticed in some figures (like this ones on this page that you just put "True positive rate" or "False positive rate", but it's not clear (at least to me) how you derived those values. I know of some QC metrics you can use like Ti/Tv ratios, but I was wondering what you use at Broad to evaluate these tools so I know if I'm heading in the right direction. Sorry to bother you again, and thanks for all of the help so far.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin

    Hi Grant,

    Call set quality evaluation is a complex topic. The basic way we calculate false vs. true positives is to compare calls to a database of highly curated calls which we use as "truth" data. Here, the selection of the truth data is key to the validity of the comparison, of course. We have some internal resources for this, as well as some public resources such as the datasets provided in our resource bundle. They are described (with an estimate or their reliability) in the FAQ article on VQSR training/truth datasets.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.