Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

GenotypeGVCFs - Sites missing

Hi,

I'm trying to call genotypes at a set of given positions from ~5,000 samples. To do so, I first generate gvcf files using HaplotypeCaller for the region around a gene for all samples separately with HaplotypeCaller. Then I'm merging the single gvcf files into batches of 1000 and then I'm generating the genotypes at ~180 sites using GenotypeGVCFs and the flag --includeNonVariantSites. Everything runs without errors, but in the end I noticed that ~20 sites were missing from the final VCF file. I then looked into these sites and the single files in detail and found out that there are simply no genotypes called for some samples at some positions.

I attached a small example. When I'm running the following command, no variant is in the output file:
java -jar GenomeAnalysisTK.jar \ -R hg19.fa \ -L region.bed --includeNonVariantSites \ -T GenotypeGVCFs \ --variant wrong.gvcf --variant correct.gvcf \ -o out.vcf

The same command works when running it with only "correct.gvcf" as an input, but obviously not when running it with only "wrong.gvcf". I noticed that wrong.gvcf might show a variant at this exact position.

Best,
Thomas

Tagged:

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Thomas,

    Can you tell us exactly which position you're looking at, and what the desired result looks like in the final file?

  • thomas_wthomas_w MunichMember

    Hi Geraldine,

    here is what I do in detail in this example (in reality I'm doing this with 5,000 instead of 2 samples):

    I want to get the genotype at position chr12:64875771 for the two samples CORRECT and WRONG. When I'm running Haplotype Caller directly at the BAM files I get these results:
    java -jar GenomeAnalysisTK.jar \ -R hg19.fa \ -L chr12:64875771 -ERC BP_RESOLUTION \ -T HaplotypeCaller \ -I correct.bam \ -o hc.correct.vcf
    grep -v "#" hc.correct.vcf chr12 64875771 . C <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:67,0:67:99:0,120,1800
    And:
    java -jar GenomeAnalysisTK.jar \ -R hg19.fa \ -L chr12:64875771 -ERC BP_RESOLUTION \ -T HaplotypeCaller \ -I wrong.bam \ -o hc.wrong.vcf
    grep -v "#" hc.wrong.vcf chr12 64875771 . C <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:56,2:58:99:0,120,1800

    Since I want to do this for 5,000 samples in reality, what I actually want to do is to first get gvcf files from the BAM files and then use GenotypeGVCFs to get the genotpyes of all samples:
    java -jar GenomeAnalysisTK.jar \ -R hg19.fa \ -L chr12:64845000-64900000 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 \ -T HaplotypeCaller \ -I correct.bam \ -o correct.gvcf
    java -jar GenomeAnalysisTK.jar \ -R hg19.fa \ -L chr12:64875771 --includeNonVariantSites \ -T GenotypeGVCFs \ --variant correct.gvcf \ -o ggvcf.correct.vcf
    grep -v "#" ggvcf.correct.vcf chr12 64875771 . C . . . DP=22;NCC=1 GT:DP 0/0:22

    And:
    java -jar GenomeAnalysisTK.jar \ -R hg19.fa \ -L chr12:64845000-64900000 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 \ -T HaplotypeCaller \ -I wrong.bam \ -o wrong.gvcf
    java -jar GenomeAnalysisTK.jar \ -R hg19.fa \ -L chr12:64875771 --includeNonVariantSites \ -T GenotypeGVCFs \ --variant wrong.gvcf \ -o ggvcf.wrong.vcf
    grep -v "#" ggvcf.wrong.vcf

    So for the sample WRONG, the file ggvcf.wrong.vcf is empty. Also when I'm running GenotypeGVCFs on both files together, the resulting file is empty:
    java -jar GenomeAnalysisTK.jar \ -R hg19.fa \ -L chr12:64875771 --includeNonVariantSites \ -T GenotypeGVCFs \ --variant wrong.gvcf \ --variant correct.gvcf \ -o ggvcf.both.vcf
    grep -v "#" ggvcf.both.vcf

    All files from this example are attached.

    Best,
    Thomas

  • thomas_wthomas_w MunichMember

    Hi Geraldine,

    apparently v.3.3 solves the problem! Thanks for your help!

    Best,
    Thomas

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Wonderful! Thanks for letting me know :)

Sign In or Register to comment.