Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

long deletion over interval end leads to incorrect gvcf and crashes GenotypeGVCFs

Hi,
I have a problem with joint genotpying a large Exome cohort using GATK 3.5. I am trying to call genotypes on the exonic intervals only, but it appears that in a few locations a long deletion spanning outside of the interval results in a reference sequence mismatch, which trips up GenotypeGVCF, leading to an error message as seen below.

Here the relevant region from the interval file, as BED, which I use for the initial HaplotypeCaller in gvcf mode:
chr2 27380091 27380444 chr2 27380460 27380648

Here is the relevant snippet from the single vcf that has the long deletion spanning from the first interval almost up to the second:

chr2 27380323 . A <NON_REF> . . END=27380420 GT:DP:GQ:MIN_DP:PL 0/0:31:60:21:0,60,743 chr2 27380421 . TCCACTGACCCCACCCCCGCTCCTCTCCCCTCAAGG T,<NON_REF> 0 . DP=29;ExcessHet=3.0103;MLEAC=0,0;MLEAF=0.00,0.00;RAW_MQ=104400.00 GT:AD:DP:GQ:PL:SB 0/0:27,0,0:27:79:0,79,6544,81,6546,6548:7,20,0,0 chr2 27380461 . T <NON_REF> . . END=27380461 GT:DP:GQ:MIN_DP:PL 0/0:29:0:29:0,0,595
Since I am calling 18.000 exomes together, I combine them in chunks of 100. The result of CombineGVCF at this region looks like this:

chr2 27380421 . TCCACTGACCCCACCCCCGCTCCTCTCCCCTCAAGG T,<NON_REF> . chr2 27380422 . C *,<NON_REF> . chr2 27380423 . C *,<NON_REF> . chr2 27380424 . A *,<NON_REF> . chr2 27380425 . C *,<NON_REF> . chr2 27380426 . T *,<NON_REF> . chr2 27380427 . G *,<NON_REF> . chr2 27380428 . A *,<NON_REF> . chr2 27380429 . C *,<NON_REF> . chr2 27380430 . C *,<NON_REF> . chr2 27380431 . C *,<NON_REF> . chr2 27380432 . C *,<NON_REF> . chr2 27380433 . A C,*,<NON_REF> . chr2 27380434 . C *,<NON_REF> . chr2 27380435 . C *,<NON_REF> . chr2 27380436 . C *,<NON_REF> . chr2 27380437 . C *,<NON_REF> . chr2 27380438 . C *,<NON_REF> . chr2 27380439 . G C,*,<NON_REF> . chr2 27380440 . C *,<NON_REF> . chr2 27380441 . T C,*,<NON_REF> . chr2 27380442 . C *,<NON_REF> . chr2 27380443 . C *,<NON_REF> . chr2 27380444 . T *,<NON_REF> . chr2 27380460 . T *,<NON_REF> . chr2 27380461 . T <NON_REF> .'

Note the sequence content T at position 27380460. This does not match the actual reference C (second to last base in the call below)

java -jar GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar -R GRCh38/GRCh38.fa -T FastaReferenceMaker -L chr2:27380421-27380461 2> /dev/null \>1 chr2:27380421 TCCACTGACCCCACCCCCGCTCCTCTCCCCTCAAGGCCCCT

This then leads to downstream problems during GenotpyeVCFs which errors out with a message:

ERROR MESSAGE: The provided variant file(s) have inconsistent references for the same position(s) at chr2:27380460, C* vs. T*

This looks to me that the CombineGVCF step somehow mangles the sequence content of the actual deletion into the reference column of the multisample vcf. I have seen this happening at five different occasions in my dataset. Of course I could simply remove those five intervals from my set and be done with it, but I suppose this error will pop up more often with larger datasets, so I am looking for a better solution.

BTW, I noted that the actual deletion is reported with an allele depth of 0 in the initial gvcf ( 0/0:27,0,0). This seems weird to me but I assume that this is just one of the magic tricks that the de-novo assembler in the haplotype caller does. Or maybe that is already the root cause of all this mess.

Thanks for any feedback on this

Jens

Issue · Github
by Sheila

Issue Number
840
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
chandrans

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @JensR
    Hi Jens,

    Sounds like a potential bug. Can you please share your files so we can debug locally? Instructions are here.

    Thanks
    Sheila

  • JensRJensR SFMember

    Hi Sheila,

    Sorry for the delay, but it took me a while to reduce the number of inputs and restrict the intervals to something large enough to still reproduce the bug, but small enough to upload.

    The input files, reference genome, logs and commands ar bundled in a tar file, I just uploaded to your ftp: deletion_over_interval_ends_bug.tgz

    Let me know if you need any additional input to recreate the outcome on your end.

    Thanks
    Jens

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @JensR
    Hi Jens,

    Thanks. I will have a look soon and get back to you.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @JensR
    Hi Jens,

    I am investigating your bug right now, and I found something funny in one of the GVCF files. In the Banner-08-08.vcf, I see that there is a deletion at position 27380443 that spans 4 positions. However, the next site after that position in 27380461. So, positions 27380447-27380460 are missing in the GVCF. Note, position 27380460 is the position that throws an error in GenotypeGVCFs. Can you tell me how you generated the GVCFs? Can you try re-generating the GVCF for that sample (you can simply tun HaplotypeCaller on a small interval surrounding those sites) and let me know if you still see the same issue. If so, I will need you to submit the original BAM files.

    Thanks,
    Sheila

  • JensRJensR SFMember

    Hi Sheila,

    thanks for looking into it and sorry for not coming back more quickly, but it took me a while to figure things out.

    After reading your comments I went back to each command that I used during the analysis and noticed that at some time point between running HaplotypeCaller and GenotypeGVCF I switched from using a bed file for intervals to the GATK style interval lists (chr1:1-1000000). However, I was lazy (or stupid) and did not shift the intervals by 1 to translate between 0 and 1-based coordinates. This broke GenotypeGVCF as it was trying to call variants on positions that where not included in the intervals used for calling the variants using Haplotype caller. I reran my whole cohort from scratch using consostent intervals and this time around everything worked like a charm.

    Case closed.

    Jens

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Great, glad to hear you were able to locate the error. Thanks for letting us know.

Sign In or Register to comment.