If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
long deletion over interval end leads to incorrect gvcf and crashes GenotypeGVCFs
I have a problem with joint genotpying a large Exome cohort using GATK 3.5. I am trying to call genotypes on the exonic intervals only, but it appears that in a few locations a long deletion spanning outside of the interval results in a reference sequence mismatch, which trips up GenotypeGVCF, leading to an error message as seen below.
Here the relevant region from the interval file, as BED, which I use for the initial HaplotypeCaller in gvcf mode:
chr2 27380091 27380444 chr2 27380460 27380648
Here is the relevant snippet from the single vcf that has the long deletion spanning from the first interval almost up to the second:
chr2 27380323 . A <NON_REF> . . END=27380420 GT:DP:GQ:MIN_DP:PL 0/0:31:60:21:0,60,743 chr2 27380421 . TCCACTGACCCCACCCCCGCTCCTCTCCCCTCAAGG T,<NON_REF> 0 . DP=29;ExcessHet=3.0103;MLEAC=0,0;MLEAF=0.00,0.00;RAW_MQ=104400.00 GT:AD:DP:GQ:PL:SB 0/0:27,0,0:27:79:0,79,6544,81,6546,6548:7,20,0,0 chr2 27380461 . T <NON_REF> . . END=27380461 GT:DP:GQ:MIN_DP:PL 0/0:29:0:29:0,0,595
Since I am calling 18.000 exomes together, I combine them in chunks of 100. The result of CombineGVCF at this region looks like this:
chr2 27380421 . TCCACTGACCCCACCCCCGCTCCTCTCCCCTCAAGG T,<NON_REF> . chr2 27380422 . C *,<NON_REF> . chr2 27380423 . C *,<NON_REF> . chr2 27380424 . A *,<NON_REF> . chr2 27380425 . C *,<NON_REF> . chr2 27380426 . T *,<NON_REF> . chr2 27380427 . G *,<NON_REF> . chr2 27380428 . A *,<NON_REF> . chr2 27380429 . C *,<NON_REF> . chr2 27380430 . C *,<NON_REF> . chr2 27380431 . C *,<NON_REF> . chr2 27380432 . C *,<NON_REF> . chr2 27380433 . A C,*,<NON_REF> . chr2 27380434 . C *,<NON_REF> . chr2 27380435 . C *,<NON_REF> . chr2 27380436 . C *,<NON_REF> . chr2 27380437 . C *,<NON_REF> . chr2 27380438 . C *,<NON_REF> . chr2 27380439 . G C,*,<NON_REF> . chr2 27380440 . C *,<NON_REF> . chr2 27380441 . T C,*,<NON_REF> . chr2 27380442 . C *,<NON_REF> . chr2 27380443 . C *,<NON_REF> . chr2 27380444 . T *,<NON_REF> . chr2 27380460 . T *,<NON_REF> . chr2 27380461 . T <NON_REF> .'
Note the sequence content T at position 27380460. This does not match the actual reference C (second to last base in the call below)
java -jar GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar -R GRCh38/GRCh38.fa -T FastaReferenceMaker -L chr2:27380421-27380461 2> /dev/null \>1 chr2:27380421 TCCACTGACCCCACCCCCGCTCCTCTCCCCTCAAGGCCCCT
This then leads to downstream problems during GenotpyeVCFs which errors out with a message:
ERROR MESSAGE: The provided variant file(s) have inconsistent references for the same position(s) at chr2:27380460, C* vs. T*
This looks to me that the CombineGVCF step somehow mangles the sequence content of the actual deletion into the reference column of the multisample vcf. I have seen this happening at five different occasions in my dataset. Of course I could simply remove those five intervals from my set and be done with it, but I suppose this error will pop up more often with larger datasets, so I am looking for a better solution.
BTW, I noted that the actual deletion is reported with an allele depth of 0 in the initial gvcf ( 0/0:27,0,0). This seems weird to me but I assume that this is just one of the magic tricks that the de-novo assembler in the haplotype caller does. Or maybe that is already the root cause of all this mess.
Thanks for any feedback on this