Haplotype caller reports a variant when using the gVCF format but not when using the VCF format

Dear GATK team,
I have a variant that is called when setting the --emitRefConfidence to GVCF but it is not called when the output is switched to the VCF format.

gVCF output:
chr11 108250864 . G C, 317.78 . BaseQRankSum=0.294;ClippingRankSum=0.000;DP=12;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.000;RAW_MQ=43200.00;ReadPosRankSum=-0.582 GT:AD:DP:GQ:PL:SB 0/1:1,11,0:12:0:343,0,0,346,33,378:0,1,5,6

VCF output:
chr11 108250864 . G C 317.78 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.294;ClippingRankSum=0.000;DP=12;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=26.48;ReadPosRankSum=-0.582;SOR=0.293 GT ./.

Please, find attached an IGV screenshot of this region. The variant seems obvious.

Any help would be much appreciated.

Thanks very much in advance.

Regards,

Sheila

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @sheilazt
    Hi Sheila,

    Can you post the bamout file?

    Also, because the GQ is 0, when you run GenotypeGVCFs, the final VCF will output ./. as well.

    -Sheila

    P.S. I am not sure why the call is being made with GQ0. I may need you to submit a bug report after I see the bamout file.

  • SteveLSteveL BarcelonaMember
    edited January 2017

    Dear @Sheila,

    I have observed something similar to @sheilazt. i.e. a position with excellent read-depth, but with GQ of 0. In my case it is clearly preceded by an insertion - which may even be an error in the GRCh37 genome - certainly the "insertion" seems to be ubiquitous.

    zgrep -A1 -B1 761958 ../gVCF/E075228/E075228.1.g.vcf.gz

    1 761957 . A AT, 1207.73 . DP=30;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;MQ0=0;RAW_MQ=43308.00 GT:AD:DP:GQ:PL:SB 1/1:0,30,0:30:93:1245,93,0,1245,93,1245:0,0,30,0
    1 761958 . C . . END=761958 GT:DP:GQ:MIN_DP:PL 0/0:35:0:35:0,0,0
    1 761959 . T . . END=762158 GT:DP:GQ:MIN_DP:PL 0/0:125:99:36:0,100,1349

    I am curious as to why haplotypecaller thinks the 1:761958 position is unreliable as a homozygous ref call?

    I have uploaded E075228.1-761958.bamout.bam to your ftp, and attached an IGV screen-shot of same.

    Haplotype Caller version 3.6-0-g89b7209 was used.

    I have 15 samples in a large pedigree, and with the exception of one dubious-looking heterozygote call, they all got ./. when genotypeGVCF was run, as shown below - the bamout and IGV screenshot are for the first individual, which has the greatest coverage at this position.

    1 761958 . C T 47.03 . AC=1;AF=0.500;AN=2;BaseQRankSum=-2.840e+00;ClippingRankSum=0.00;DP=281;ExcessHet=3.0103;FS=0.000;LikelihoodRankSum=-1.950e+00;MLEAC=1;MLEAF=0.500;MQ=12.65;MQ0=0;MQRankSum=2.21;QD=1.68;ReadPosRankSum=0.601;SOR=0.061 GT:AD:DP:GQ:PGT:PID:PL ./.:35,0:35:.:.:.:0,0,0 ./.:21,0:21:.:.:.:0,0,0 ./.:16,0:16:.:.:.:0,0,0 ./.:12,0:12:.:.:.:0,0,0 ./.:15,0:15:.:.:.:0,0,0 ./.:22,0:22:.:.:.:0,0,0 0/1:23,5:28:71:1|0:761957_A_AT:71,0,841 ./.:12,0:12:.:.:.:0,0,0 ./.:24,0:24:.:.:.:0,0,0 ./.:17,0:17:.:.:.:0,0,0 ./.:14,0:14:.:.:.:0,0,0 ./.:20,0:20:.:.:.:0,0,0 ./.:8,0:8:.:.:.:0,0,0 ./.:12,0:12:.:.:.:0,0,0 ./.:25,0:25:.:.:.:0,0,0

    Thanks for your help, as always.

    Steve

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @SteveL
    Hi Steve,

    You may find this thread interesting/relevant. Are you showing the soft clipped bases in your bamout screenshot? Can you also post a larger region surrounding the area (~1000 bases before and after should be good)?

    Thanks,
    Sheila

  • sheilaztsheilazt ValenciaMember

    Dear Sheila,
    Sorry for my late reply. I hadn't realized you had replied to my message.

    I've just uploaded the BAM file obtained with the --bamOutput option into your FTP server with these credentials.

    location: ftp.broadinstitute.org
    username: gsapubftp
    password: 5WvQWSfi

    The file name is P288_bamout.bam

    Thanks very much in advance.

    Sheila

  • SteveLSteveL BarcelonaMember

    Hi @sheila,
    Thanks for your response, the bam I uploaded ot your FTP (E075228.1-761958.bamout.bam) had the +/-1000 you requested. And here is the equivalent screenshot. I neglected to include it in my original post, but I see the GQ for the band before the insert is 0 as well.

    I haven't had time to investigate extensively, but I have noticed other similar positions in the gVCF - i.e. an A->AX insertion with good depth and GQ, followed by a single reference position with GQ of 0, despite excellent depth. Might be an artefact from BQSR?

    I spotted these positions initially because once GenoetypeGVCFs is run, they get converted to ./. despite the high depth, as reported by @sheilazt

    If you have some other ideas you would like me to explore with the same sample, let me know and I will investigate further.

    Original position (I see the editor is removing anything between greater-than and less-than signs, so I stripped them here first, to be clear)

    1 761910 . T NON_REF . . END=761956 GT:DP:GQ:MIN_DP:PL 0/0:19:0:13:0,0,0
    1 761957 . A AT,NON_REF 1207.73 . DP=30;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;MQ0=0;RAW_MQ=43308.00 GT:AD:DP:GQ:PL:SB 1/1:0,30,0:30:93:1245,93,0,1245,93,1245:0,0,30,0
    1 761958 . C NON_REF . . END=761958 GT:DP:GQ:MIN_DP:PL 0/0:35:0:35:0,0,0
    1 761959 . T NON_REF . . END=762158 GT:DP:GQ:MIN_DP:PL 0/0:125:99:36:0,100,1349
    1 762159 . T C,<NON_REF 4128.77 . BaseQRankSum=-0.608;ClippingRankSum=0.000;DP=325;ExcessHet=3.0103;LikelihoodRankSum=-0.287;MLEAC=1,0;MLEAF=0.500,0.00;MQ0=0;MQRankSum=1.176;RAW_MQ=946322.00;ReadPosRankSum=-0.324 GT:AD:DP:GQ:PL:SB 0/1:167,156,0:323:99:4157,0,4473,4658,4942,9599:79,88,80,76

    Similar suspect positions from same gVCF:

    1 178514554 . A NON_REF . . END=178514554 GT:DP:GQ:MIN_DP:PL 0/0:213:21:213:0,21,315
    1 178514555 . G NON_RE . . END=178514559 GT:DP:GQ:MIN_DP:PL 0/0:211:6:208:0,6,90
    1 178514560 . A AT,NON_REF 6361.73 . DP=204;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;MQ0=0;RAW_MQ=735700.00 GT:AD:DP:GQ:PL:SB 1/1:0,195,0:195:99:6399,592,0,6399,593,6400:0,0,76,119
    1 178514561 . T NON_REF . . END=178514561 GT:DP:GQ:MIN_DP:PL 0/0:211:0:211:0,0,0
    1 178514562 . T NON_REF . . END=178515128 GT:DP:GQ:MIN_DP:PL 0/0:216:99:41:0,105,1575

    zgrep -A2 -B2 228461408 E075228.1.g.vcf.gz
    1 228461399 . T NON_REF . . END=228461399 GT:DP:GQ:MIN_DP:PL 0/0:65:24:65:0,24,360
    1 228461400 . G NON_REF . . END=228461407 GT:DP:GQ:MIN_DP:PL 0/0:70:0:68:0,0,0
    1 228461408 . A AG,NON_REF 2209.73 . DP=69;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;MQ0=0;RAW_MQ=249700.00 GT:AD:DP:GQ:PL:SB 1/1:0,66,0:66:99:2247,201,0,2247,202,2247:0,0,46,20
    1 228461409 . G NON_REF . . END=228461409 GT:DP:GQ:MIN_DP:PL 0/0:69:0:69:0,0,0
    1 228461410 . G NON_REF . . END=228462532 GT:DP:GQ:MIN_DP:PL 0/0:245:99:40:0,102,1423

    Issue · Github
    by Sheila

    Issue Number
    1672
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • SteveLSteveL BarcelonaMember

    P.S. I am not showing soft-clipped bases is the screenshot - as this is bam-out - I presume HC when making the call, looks at the variant +/- certain region when trying to make the haplotypes. Including soft-clipped did not appear to make any visible difference.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @sheilazt
    Hi Sheila.

    Wow, that region is super messy. Can you post a larger region? You can just post some IGV screenshots to this thread (no need to upload to the server). Please include about 1000 bases before and after the site of interest. Please also post the original BAM file and bamout file.

    Thanks,
    Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @sheilazt
    Hi again Sheila,

    Ooops! I just realized thanks to Soo Hee, that I was viewing the reads against the wrong reference! That is why it looked so messy. Can you submit a bug report following the instructions here.

    Thanks,
    Sheila

  • sheilaztsheilazt ValenciaMember

    Dear Sheila,
    I've just uploaded the information you requested in your server. The name of the file is sheilazt.zip.

    Thanks very much in advance!

    Regards,

    Sheila

    Issue · Github
    by Sheila

    Issue Number
    1679
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hey everyone, this looks like an issue reported a while ago here.

    Do you agree it sounds like the same symptoms? If so this is an edge case we know about; we have a ticket to fix it in GATK4, although no one is available to work on it right now. So for now it's considered a known limitation.

  • sheilaztsheilazt ValenciaMember

    Dear all,
    Although the variant I reported was not an indel the GQ is 0 in the GVCF output so it might be the same issue as the one you're pointing out.

    I'll bare this limitation in mind then.

    Thanks very much!

    Regards,

    Sheila

  • SteveLSteveL BarcelonaMember

    Hi @Geraldine_VdAuwera,

    It would appear to be the same issue. I would think this is a fairly serious limitation - particularly if trying to identify de novos in a trio for example. Do we know if there is a work-around whereby we can identify such sites with a reasonable degree of confidence and reassign them a 0/0 GT instead?

    I am also surprised that this probably won't be resolved in the initial 4.0 release, given that it has been around for 15 months. Is that still the case?

    Thanks for your help as always.

    Steve

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @SteveL I would suggest you contact Chuck (@chlangley) in the original thread I linked to above; he did a lot of digging at the time and last we heard was looking into producing ad-hoc filters to identify and deal with these sites. He might have a workaround he would be willing to share.

    Regarding GATK4, please understand that it has been a huge undertaking to rewrite the entire underlying framework, plus rewrite key parts of tools like HaplotypeCaller that were rather monstrous in their original implementation. Much of the focus of the rewrite has been to streamline the code and operation, make it much leaner, faster and more scalable, without adversely affecting the quality of the calls. For the past few months we have been putting the ported tools through their paces, painstakingly evaluating results and fixing bugs and regressions. On the way we have identified more areas that can be improved, including the assembly engine which is the culprit here, and we are tackling those, not as individual bug fixes, but as deep rewrites of the affected code. However we want to at least output a beta version that produces equivalent results before releasing the fully improved functionality. The latter might be what we end up calling "4.0" -- but the true initial release will be the upcoming beta, which is slated for the end of this quarter (March).

  • SteveLSteveL BarcelonaMember

    Thanks @Geraldine_VdAuwera . I can appreciate it must be a huge undertaking to rewrite everything, and of course you have to work through the highest priority issues first, and I am sure you have a long list (I am familiar with such myself). Best of luck to you and the team over the next few months!

Sign In or Register to comment.