Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office on October 14, 2019, due to the U.S. holiday. We will return to monitoring the forum on October 15.

Exome sequencing - additional public data for variant calling

siro12345siro12345 munichMember

Hi,
I'm new to exome sequencing, sorry if the questions have really obvious answers.

My data set contains only 3 different samples from mother, father and daughter.
So far I'm doing the standard thing - IndelRealigner -> HaplotypeCaller -> VariantRecalibrator..

Quesion 1: HaplotypeCaller is recommended. I tried UnifiedGenotyper as well, which outputs about 30% more raw variants. Is that expected?

Question 2: This thread recommends using public data from 1000genomes if the sample size is smaller than 30. Available data sets from 1000GP don't use the Nextera Illumina technology for capture. Is that a problem, should I look for public data that uses the exact same approach as us?

Thanks for your help, I appreciate it ! :-)

Answers

  • siro12345siro12345 munichMember
  • SheilaSheila admin Broad InstituteMember, Broadie, Moderator admin

    @siro12345
    Hi,

    1) I don't know about 30% more, but I suspect the extra variants will be filtered out in the filtering step. HaplotypeCaller is better at calling indels which eliminates a lot of false positive SNP calls that UnifiedGenotyper makes.

    2) As long as you pre-process the bam files from 1000Genomes the same way you did your samples, you should be fine. Have a look at this thread for more information: http://gatkforums.broadinstitute.org/discussion/4591/varian-quality-recalibration-annotations

    -Sheila

  • siro12345siro12345 munichMember

    Hey Sheila,

    thank you very much for the reply!

    1) That's what I was sort of guessing as well, so I'll definitely stick with HC.

    2) Maybe I did't mention all the necessary details. So my data is different from the 1000genomes in 2 ways: the exome enrichment method, so the targets don't overlap completely. And second, my data was run on a NextSeq, so the quality is lower.
    Both of these taken into account, wouldn't a data set of 30 Samples from the 1000 genomes data outweigh my 3 samples? So a model build by the VQSR would primarily fit the 1000genomes data but potentially be harmful when applied to my data?

    I could not find anything in this regard in the other post on the forum, sorry if I overlooked something.
    Thanks a lot for your help :-)

  • Geraldine_VdAuweraGeraldine_VdAuwera admin Cambridge, MAMember, Administrator, Broadie admin

    @siro12345 That is definitely a concern. The question is going to be, how harmful is it vs. not being able to use VQSR at all and having to use hard filters instead. This is an honest open question; I don't know the answer. If you can find public data that uses the same approach as you did, that's probably better. Our recommendation to use 1000G was formulated at the time 1000G was about the only place you could get a decent set of publicly available samples. Now, we should really say "1000G or any other public dataset that matches your data type as closely as possible".

Sign In or Register to comment.