Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office on November 11th and 13th 2019, due to the U.S. holiday(Veteran's day) and due to a team event(Nov 13th). We will return to monitoring the GATK forum on November 12th and 14th respectively. Thank you for your patience.

a strange varinat from HC

mayaabmayaab IsraelMember ✭✭

Hello GATK team,

  1. I've noticed a strange variant in the gvcf output of HC:

Raw vcf:
6 32552140 rs17882663 T A, 108.18 . DB;DP=0;MLEAC=2,0;MLEAF=1.00,0.00;MQ=0.00;MQ0=0 GT:GQ:PGT:PID:PL:SB 1/1:9:0|1:32552059_G_T:135,9,0,135,9,135:0,0,0,0

After GenortpeGVCFs:
6 32552140 rs17882663 T A 107.28 . AC=2;AF=1.00;AN=2;DB;DP=0;FS=0.000;GQ_MEAN=9.00;MLEAC=2;MLEAF=1.00;MQ=0.00;MQ0=0;NCC=0;SOR=0.693 GT:AD:GQ:PGT:PID:PL 1/1:0,0:9:1|1:32552059_G_T:135,9,0

The DP and AD are 0, but there is a variant - A.
What do you think? Why does it happened?

  1. What is the difference between the DP in the format part and in the info part? I looked for an answer in the documentation, but couldn't find one. Bellow is an example of a big difference between the values of this two.
    3 195511214 . G GACCTGTGGATGCTGAGGAAGTGTCGGTGACAGGAAGAGGGGTGGTGTC 673.77 . AC=1;AF=0.500;AN=2;DP=169;FS=0.000;GQ_MEAN=38.00;MLEAC=1;MLEAF=0.500;MQ=56.57;MQ0=0;NCC=0;SOR=0.693 GT:AD:DP:GQ:PL 0/1:0,0:0:38:702,0,38

Maya

Issue · Github
by Sheila

Issue Number
940
State
open
Last Updated

Best Answer

  • SheilaSheila Broad Institute admin
    Accepted Answer

    @mayaab
    Hi Maya,

    1) The issue of chromosome 6 can be fixed by adding -dontUseSoftClippedBases. https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php#--dontUseSoftClippedBases Because there is such low coverage in the region, one soft clipped read is enough to skew the evidence in favor of an alternate allele. As for the AD and DP being 0, I just found out from Laura @gauthier that the log10 likelihood between each read's most likely allele and second most likely allele must be higher than 0.2. If the log10 likelihood is less than 0.2, the read is considered uninformative and is not reported in AD. I realize this may be confusing, so I am about to prepare a document on this. It was news to me too!

    2) For the chromosome 3 site, DP in the INFO field is the number of filtered reads before HaplotypeCaller reassembly. The FORMAT field DP is the number of filtered reads after Haplotype Caller's reassembly. Again, at this particular position, all the reads are uninformative (as described above).

    I hope this helps.

    I will try to have the document on uninformative reads posted by the end of the week.

    -Sheila

Answers

  • SheilaSheila Broad InstituteMember, Broadie admin

    @mayaab
    Hi Maya,

    This is indeed incorrect behavior. Can you please post IGV screenshots of the original bam file and bamout files at those positions?

    Thanks,
    Sheila

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    @mayaab Interesting. Do you mind, if I ask, which version of GATK are you using? I haven't experienced this problem myself with the older version 3.2 on ~2000 samples. I'll follow this thread. Thanks for posting.

  • mayaabmayaab IsraelMember ✭✭

    Thanks for the response.
    Here are the IGV screenshots for those positions. The upper track is the original bam (multimapped and duplicates were filtered), the lower track is the bamout of HC.
    1. The position of the reads is very different after HC. As can be seen, there is coverage at that position in both files.

    1. The coverage in the original file at that position is 258, the coverage in the bamout is 356, and non of those numbers appear in the gvcf file. This gvcf file was created on one sample alone.
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @mayaab Please confirm which version this was done with.

  • mayaabmayaab IsraelMember ✭✭

    version 3.3-0

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Alright, we're going to need a snippet to troubleshoot locally.Can you pleaseshare some test data with us? Instructions are here: https://www.broadinstitute.org/gatk/guide/article?id=1894

  • mayaabmayaab IsraelMember ✭✭

    Yes, no prob. We are having some network problems, so I don't have access to the data right now. I will upload the files as soon as this issue will be solved.

  • mayaabmayaab IsraelMember ✭✭

    I've uploaded the files. The archive file's name is: to_GATK_team.ta.gz
    in the directory you will find two directories:
    1. cov_0_variant_R578 - all the files related to the variant that appear at a possition that reports no coverage.
    2. different_cov_R538 - all the files related to the differant coverage of the two DPs

  • SheilaSheila Broad InstituteMember, Broadie admin
    Accepted Answer

    @mayaab
    Hi Maya,

    1) The issue of chromosome 6 can be fixed by adding -dontUseSoftClippedBases. https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php#--dontUseSoftClippedBases Because there is such low coverage in the region, one soft clipped read is enough to skew the evidence in favor of an alternate allele. As for the AD and DP being 0, I just found out from Laura @gauthier that the log10 likelihood between each read's most likely allele and second most likely allele must be higher than 0.2. If the log10 likelihood is less than 0.2, the read is considered uninformative and is not reported in AD. I realize this may be confusing, so I am about to prepare a document on this. It was news to me too!

    2) For the chromosome 3 site, DP in the INFO field is the number of filtered reads before HaplotypeCaller reassembly. The FORMAT field DP is the number of filtered reads after Haplotype Caller's reassembly. Again, at this particular position, all the reads are uninformative (as described above).

    I hope this helps.

    I will try to have the document on uninformative reads posted by the end of the week.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie admin

    @mayaab
    Hi again Maya,

    What I said above about the different DPs is true, but I am digging into your chromosome 3 issue more. There is some weird behavior going on when using different advanced arguments. We do know of issues in highly repetitive regions, and hopefully we can fix some of them with this bug.

    -Sheila

  • HasaniHasani GermanyMember

    Dear GATK Team,

    how comes that GATK gives for the same position different IDs or even no ID at all?

    P.S. Each of those mutations come from different sample but the same dbsnp file was used when calling and the same arguments.

    e.g.:

    chr21 29729190 rs71189304 chr21 29729190 rs71335030 chr21 29729190 rs376261752 chr21 29729190 rs10684666 chr21 29729190 rs71335030 chr21 29729190 rs375506947 chr21 29729190 .

    Thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @Hasani Each rsID is specific to a particular variant allele, not just the general site position.

Sign In or Register to comment.