We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Exome sequencing - additional public data for variant calling

I'm new to exome sequencing, sorry if the questions have really obvious answers.

My data set contains only 3 different samples from mother, father and daughter.
So far I'm doing the standard thing - IndelRealigner -> HaplotypeCaller -> VariantRecalibrator..

Quesion 1: HaplotypeCaller is recommended. I tried UnifiedGenotyper as well, which outputs about 30% more raw variants. Is that expected?

Question 2: This thread recommends using public data from 1000genomes if the sample size is smaller than 30. Available data sets from 1000GP don't use the Nextera Illumina technology for capture. Is that a problem, should I look for public data that uses the exact same approach as us?

Thanks for your help, I appreciate it ! :-)


  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭


    1) I don't know about 30% more, but I suspect the extra variants will be filtered out in the filtering step. HaplotypeCaller is better at calling indels which eliminates a lot of false positive SNP calls that UnifiedGenotyper makes.

    2) As long as you pre-process the bam files from 1000Genomes the same way you did your samples, you should be fine. Have a look at this thread for more information: http://gatkforums.broadinstitute.org/discussion/4591/varian-quality-recalibration-annotations


  • siro12345siro12345 munichMember

    Hey Sheila,

    thank you very much for the reply!

    1) That's what I was sort of guessing as well, so I'll definitely stick with HC.

    2) Maybe I did't mention all the necessary details. So my data is different from the 1000genomes in 2 ways: the exome enrichment method, so the targets don't overlap completely. And second, my data was run on a NextSeq, so the quality is lower.
    Both of these taken into account, wouldn't a data set of 30 Samples from the 1000 genomes data outweigh my 3 samples? So a model build by the VQSR would primarily fit the 1000genomes data but potentially be harmful when applied to my data?

    I could not find anything in this regard in the other post on the forum, sorry if I overlooked something.
    Thanks a lot for your help :-)

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @siro12345 That is definitely a concern. The question is going to be, how harmful is it vs. not being able to use VQSR at all and having to use hard filters instead. This is an honest open question; I don't know the answer. If you can find public data that uses the same approach as you did, that's probably better. Our recommendation to use 1000G was formulated at the time 1000G was about the only place you could get a decent set of publicly available samples. Now, we should really say "1000G or any other public dataset that matches your data type as closely as possible".

Sign In or Register to comment.