Holiday Notice:
The Frontline Support team will be slow to respond December 17-18 due to an institute-wide retreat and offline December 22- January 1, while the institute is closed. Thank you for your patience during these next few weeks. Happy Holidays!

Missing positions in the GVCF file

Hello,

It seems the banded gvcf produced by HaplotypeCaller is missing many positions in the genome. We are not able to find half the specific positions we want to look at. To be clear, these positions are not even contained in any band, they are simply not there. For example, we were looking for position 866438 on chromosome 1 and these are the relevant lines in the gvcf:

1 866433 . G C, 412.77 . BaseQRankSum=-1.085;DP=16;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-1.519;ReadPosRankSum=0.651 GT:AD:GQ:PL:SB 0/1:15,1,0:99:441,0,23572,546,23612,24157:0,0,0,0
1 866434 . G . . END=866435 GT:DP:GQ:MIN_DP:PL 0/0:22:45:21:0,45,675
1 866436 . ACGTG A, 403.73 . BaseQRankSum=-0.510;DP=17;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-0.714;ReadPosRankSum=0.434 GT:AD:GQ:PL:SB 0/1:16,1,0:99:441,0,23572,546,23612,24157:0,0,0,0
1 866441 . A . . END=866441 GT:DP:GQ:MIN_DP:PL 0/0:21:0:21:0,0,402

We looked at the corresponding bam files and there seems to be good coverage for these positions. Moreover, if HaplotypeCaller is run in BP_RESOLUTION mode, then these positions are present in the resulting vcf. Any idea what might be going wrong?

Thanks,

Juber

Issue · Github
by Geraldine_VdAuwera

Issue Number
851
State
closed
Last Updated
Assignee
Array
Closed By
ldgauthier

Answers

  • juberjuber Member

    Sorry,

    the gvcf snippet got formatted really bad. Here it is again:

    1 866433 . G C, 412.77 . BaseQRankSum=-1.085;DP=16;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-1.519;ReadPosRankSum=0.651 GT:AD:GQ:PL:SB 0/1:15,1,0:99:441,0,23572,546,23612,24157:0,0,0,0

    1 866434 . G . . END=866435 GT:DP:GQ:MIN_DP:PL 0/0:22:45:21:0,45,675

    1 866436 . ACGTG A, 403.73 . BaseQRankSum=-0.510;DP=17;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-0.714;ReadPosRankSum=0.434 GT:AD:GQ:PL:SB 0/1:16,1,0:99:441,0,23572,546,23612,24157:0,0,0,0

    1 866441 . A . . END=866441 GT:DP:GQ:MIN_DP:PL 0/0:21:0:21:0,0,402

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Is this directly the output of the HC or has this gone through CombineGVCFs? I think I remember we had an issue like this at the combination step, which was fixed internally...

  • juberjuber Member

    Hi Geraldine,

    This is the direct output of HC. I confirmed that.

    Thanks,

    Juber

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Were you using an intervals file with -L?

  • juberjuber Member

    No, we did not use intervals while creating the banded gvcf.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hmm, then it's a problem. Can you please try re-calling that region with the latest nightly build and let me know if the problem persists?

  • bgrenierbgrenier FranceMember

    Hello,

    I have the same problem with missing positions using HaplotypeCaller in mode GVCF compared to mode BP_resolution. I have run HaplotypeCaller using GATK v3.2-2 and the nightly build nightly-2014-08-04-gf3f6f30 and the problem still persists.

    Missing positions only occur after some indels and it appears there's no rule compared to the individual genotype. The only thing I noticed is that it always skip positions according to the indel length.

    Here are 2 examples of missing positions (GVCF mode):
    1 1588744 rs79724854 AGCG A,<NON_REF> 312.73 . BaseQRankSum=-1.640;ClippingRankSum=0.166;DB;DP=62;MLEAC=1,0;MLEAF=0.500,0.00;MQ=49.38;MQ0=0;MQRankSum=-0.516;ReadPosRankSum=-0.666 GT:AD:DP:GQ:PL:SB 0/1:51,11,0:62:99:350,0,2106,504,2142,2646:31,20,3,8 1 1588748 . T <NON_REF> . . END=1588748 GT:DP:GQ:MIN_DP:PL 0/0:61:0:61:0,0,1387

    1 1654065 rs35174499 AGCG A,<NON_REF> 452.74 . DB;DP=11;MLEAC=2,0;MLEAF=1.00,0.00;MQ=51.28;MQ0=0 GT:AD:DP:GQ:PL:SB 1/1:0,11,0:11:33:495,33,0,495,33,495:0,0,11,0 1 1654069 . T <NON_REF> . . END=1654110 GT:DP:GQ:MIN_DP:PL 0/0:15:45:11:0,0,0

    And 2 examples of indels where positions are not missing (GVCF mode):
    1 1647893 rs144636354 C CTTTCTT,<NON_REF> 612.73 . BaseQRankSum=0.774;ClippingRankSum=0.139;DB;DP=37;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-0.615;ReadPosRankSum=-1.489 GT:AD:DP:GQ:PL:SB 0/1:17,14,0:31:99:650,0,643,701,694,1395:3,14,4,10 1 1647894 . T <NON_REF> . . END=1647894 GT:DP:GQ:MIN_DP:PL 0/0:37:0:37:0,0,561

    1 884091 . C CACCCTGGTCCCCCTGGTCCCTTTGGCCCTGCACCTGGCTGG,<NON_REF> 2158.73 . DP=62;MLEAC=2,0;MLEAF=1.00,0.00;MQ=60.00;MQ0=0 GT:AD:DP:GQ:PL:SB 1/1:0,52,0:52:99:2197,164,0,2203 ,170,2209:0,0,12,40 1 884092 . A <NON_REF> . . END=884092 GT:DP:GQ:MIN_DP:PL 0/0:68:99:68:0,105,157

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    There is a big difference in the example your provided.
    The first cases where positions are "missing" (I put that in quotes because they are not in fact missing) are deletions. For these, the bases that span the deletion are all one block. For example, 1:1588744-1588747 are all covered by the first record in your example (and you can tell this because there are 4 bases given in the REF column represented by this record). So the next record should logically be 1:1588748 (which in fact it is).
    The second set of cases are insertions. For them there is only a single reference base, so logically the next record must represent the base immediately following the insertion.
    So you see that the output is very much consistent throughout.

  • bgrenierbgrenier FranceMember

    Thanks for the quick and informative answer.

    Anyway, this difference between BP resolution and GVCF modes leads to a discrepancy at the GenotypeGVCFs step.

    Here is one example

    1)GVCF mode

    1.1) HaplotypeCaller
    1 233489705 rs367946480 TTG T,TTGTG,<NON_REF> 1046.73 . BaseQRankSum=-0.762;ClippingRankSum=0.294;DB;DP=159;MLEAC=0,1,0;MLEAF=0.00,0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-0.654;ReadPosRankSum=-0.174 GT:AD:DP:GQ:PL:SB 0/2:69,6,40,0:115:99:1084,1225,3883,0,2085,2155,1292,3488,2252,3544:22,47,12,28 1 233489708 . T <NON_REF> . . END=233489797 GT:DP:GQ:MIN_DP:PL 0/0:95:99:41:0,99,1485

    1.2)GenotypeGVCFs
    1 233489705 . T TTG 1049.58 PASS ... GT:AD:DP:GQ:PL 0/1:69,40:115:99:1084,0,2155 1 233489707 . G T 5938.30 PASS ... GT:AD:DP:GQ:PL ./.:0,0

    2) BP resolution

    2.1) HaplotypeCaller
    1 233489705 rs367946480 TTG T,TTGTG,<NON_REF> 1046.73 . BaseQRankSum=-0.762;ClippingRankSum=0.294;DB;DP=159;MLEAC=0,1,0;MLEAF=0.00,0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-0.654;ReadPosRankSum=-0.174 GT:AD:DP:GQ:PL:SB 0/2:69,6,40,0:115:99:1084,1225,3883,0,2085,2155,1292,3488,2252,3544:22,47,12,28 1 233489706 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:108,42:150:0:0,0,2112 1 233489707 . G <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:146,5:151:99:0,120,1800

    2.2)GenotypeGVCFs
    1 233489705 . T TTG 1049.58 PASS ... GT:AD:DP:GQ:PL 0/1:69,40:115:99:1084,0,2155 1 233489707 . G T 5935.72 PASS ... GT:AD:DP:GQ:PL 0/0:146,5:151:99:0,120,1800

    So, after genotyping there's a discrepancy between the two modes for the SNP located at position 233489707.

  • rpoplinrpoplin Member ✭✭✭

    Hi there,

    Thanks for the information. Can you please provide the command lines that you used to generate those lines. Something doesn't quite seem right.

    Thanks for your help,

  • bgrenierbgrenier FranceMember
    edited August 2014

    Hi,

    1) To be precise, HaplotypeCaller was followed by CombineGVCFs before executing GenotypeGVCFs. Here are the command lines I used :

    1.1 HaplotypeCaller (GVCF and BP_RESOLUTION as $mode)
    java -Xmx${memory}g -jar ${gatk} \ -T HaplotypeCaller \ -R ${reference} \ -I input.bam \ -D ${dbsnp} \ -L ${interval} \ -ip 50 \ -gt_mode DISCOVERY \ -stand_call_conf 30\ -stand_emit_conf 30 \ -ERC ${mode} \ --variant_index_type LINEAR \ --variant_index_parameter 128000 \ -o output.g.vcf \ --disable_auto_index_creation_and_locking_when_reading_rods

    1.2 CombineGVCFs
    java -Xmx${memory}g -jar ${gatk} \ -T CombineGVCFs \ -R ${reference} \ -V input1.vcf.gz \ ... -V inputN.vcf.gz \ -o merged.g.vcf

    1.3 GenotypeGVCFs
    java -Xmx${memory}g -jar ${gatk} \ -T GenotypeGVCFs \ -R ${reference} \ -V merged.g.vcf.gz \ -nt 16 \ -o genotyped.vcf

    2) After running some tests, I discovered that without the combineGVCFs step, the genotype is called :

    2.1 Genotype using HaplotypeCaller (GVCF mode) + CombineGVCFs + GenotypeGVFs :
    1 233489705 . T TTG 1049.58 . ... GT:AD:DP:GQ:PL 0/1:69,40:115:99:1084,0,2155 1 233489707 . G T 5938.30 . ... GT:AD:DP:GQ:PL ./.:0,0

    2.2 Genotype using HaplotypeCaller (BP_RESOLUTION mode) + CombineGVCFs + GenotypeGVFs :
    1 233489705 . T TTG 1049.58 . ... GT:AD:DP:GQ:PL 0/1:69,40:115:99:1084,0,2155 1 233489707 . G T 5935.72 . ... GT:AD:DP:GQ:PL 0/0:146,5:151:99:0,120,1800

    2.3 Genotype using HaplotypeCaller (GVCF mode) + GenotypeGVCFs only :
    1 233489705 . T TTG 1049.58 . ... GT:AD:DP:GQ:PL 0/1:69,40:115:99:1084,0,2155 1 233489707 . G T 5935.72 . ... GT:AD:DP:GQ:PL 0/0:69,0:115:99:0,208,2460

    Following is the content of the CombineGVCFs file for that particular individual in GVCF mode :
    1 233489705 rs367946480 TTG T,TTGTG,<NON_REF> . . ... GT:AD:DP:MIN_DP:PL:SB ./.:69,6,40,0:115:.:1084,1225,3883,0,2085,2155,1292,3488,2252,3544:22,47,12,28 1 233489706 . T <NON_REF> . . . GT:DP:GQ:MIN_DP:PL ./. 1 233489707 rs7551511 G T,<NON_REF> . . ... GT:AD:DP:MIN_DP:PL:SB ./.

    Following is the content of the CombineGVCFs file or that particular individual in BP_RESOLUTION mode :
    1 233489705 rs367946480 TTG T,TTGTG,<NON_REF> . . ... ./.:69,6,40,0:115:1084,1225,3883,0,2085,2155,1292,3488,2252,3544:22,47,12,28 1 233489706 . T <NON_REF> . . . GT:AD:DP:GQ:PL ./.:108,42:150:0:0,0,2112 1 233489707 rs7551511 G T,<NON_REF> . . ... GT:AD:DP:PL:SB ./.:146,5,5:151:0,120,1800,120,1800,1800

    Thanks for your help

  • bgrenierbgrenier FranceMember

    Hi,

    I just wanted to know if there's an update concerning this CombineGVCFs problem.

    Thanks

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @bgrenier‌

    Hi,

    Can you send us a snippet of the files that are causing the error? We would like to replicate it locally.

    Instructions for how to do so are here: http://gatkforums.broadinstitute.org/discussion/1894/how-do-i-submit-a-detailed-bug-report

    Thanks,
    Sheila

  • bgrenierbgrenier FranceMember

    Hi,

    Since the problem seems to be due to CombineGVCFs (because the genotype is called when running directly GenotypeGVCFs but not if a CombineGVCFs step is ran before), should I only include the gvcf files of some individuals to repdoduce the error or do you really need bam files ?

    Thanks,

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @bgrenier‌

    Hi,

    The gvcf files are fine. There is no need to include the bam files.

    Thanks,
    Sheila

  • bgrenierbgrenier FranceMember

    @Sheila

    Hi,

    The archive containing README, commands, and files has been uploaded under the name "bgrenier_combineGVCFs.tar.gz"
    If something doesn't seem to be clear, feel free to ask.

    Thanks,

  • bgrenierbgrenier FranceMember
    edited September 2014

    Hi, just wondering if there was any updates concerning that problem.

    Thanks

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @bgrenier‌

    Hi,

    I am still working on this bug report. We have been busy with a workshop the past few weeks. I will let you know when the report is submitted and when it is fixed.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @bgrenier‌

    Hi,

    Thanks for the reminder yesterday. I just submitted the bug report. I will let you know when it is fixed.

    -Sheila

  • udpudp virginiaMember

    Hi,

    I have somewhat similar but a different issue. I am running GATK3.2.2 version. I use the following command for running haplotypecaller in gvcf mode:

    java -Xmx2g -jar $JAR \
    -T HaplotypeCaller \
    -L region.bed \
    -R $REF \
    -I $sample.bwamem.sort.dedup.realigned.bam \
    -stand_call_conf 30 \
    -stand_emit_conf 10 \
    --emitRefConfidence GVCF \
    --variant_index_type LINEAR \
    --variant_index_parameter 128000 \
    -o $sample.raw.snps.indels.g.vcf \
    --bamOutput $sample.haplo.bam

    For a variant (chr1:89850791-89850792) I see the evidence in both the above vcf output as well as the bam output. But when I do a joint genotyping for all samples, I miss the variant in the resulting VCF (all.haplotype.vcf). These are the parameters (basically same) I use for joint genotyping:

    java -Xmx2g -jar $JAR \
    -T GenotypeGVCFs \
    -V files.list \
    -R $REF \
    -stand_call_conf 30 \
    -stand_emit_conf 10 \
    -o all.haplotype.vcf

    This looks like a bug to me. Can you please suggest?

    Regards,
    Uma

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @udp‌

    Hi Uma,

    Have you checked the quality score of the call in the gvcf? GVCF mode outputs out possible calls regardless of the quality scores. In the resulting vcf, low quality score calls do get filtered out.

    -Sheila

  • bgrenierbgrenier FranceMember

    @Sheila‌
    Sorry to bother you again but do you have any update on this topic ?

    Thanks,

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    I have a similar problem. I am expecting to see the variant C/T for sample NA12878 at coordinate 20:544040:

    wget ftp://ftp.ncbi.nih.gov/giab/ftp/data/NA12878/variant_calls/GIAB_integration/NIST_RTG_PlatGen_merged_highconfidence_v0.2_Allannotate.vcf.gz
    zgrep -w 544040 NIST_RTG_PlatGen_merged_highconfidence_v0.2_Allannotate.vcf.gz
    20  544040  .   C   T   4183.5  PASS    DP=869;DPR=0.998;PGC=5,2,2,2;PGDR=1.034,1.000,0.976,1.038;PHQ=44;source=NISTPASS,RTGPHQ;PHC GT:DP:DPR:RE:AR:RQ:GQ:ABP:SBP:RPB:PPB:PUR:RS:AD:GL:AVR:PS   0|1:55:1.092:0.082:0.000:628.7:629:1.93:1.45:0.36:0.28:0.02:C,31,0.046,T,24,0.036:31,24:-62.87,0.00,-96.84:0.6386:61098
    

    And it is indeed correctly called by HaplotypeCaller3.2:

    zgrep -w 544040 out_HaplotypeCaller/20/NA12878.mapped.ILLUMINA.bwa.CEU.low_coverage.20121211.vcf.gz
     20 544040  rs17770500  C   T,<NON_REF> 48.77   .   BaseQRankSum=1.026;DB;DP=4;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=0.000;ReadPosRankSum=1.026   GT:AD:GQ:PL:SB  0/1:2,2,0:62:78,0,62,84,69,152:0,2,1,1
    

    However, the GT becomes ./., when I run CombineGVCFs3.2:

    20  544040  rs17770500  C   T,<NON_REF> .   .   BaseQRankSum=1.03;DP=4;MQ=60.00;MQ0=0;MQRankSum=0.00;ReadPosRankSum=1.03    GT:AD:PL:SB ./.:2,2,0:78,0,62,84,69,152:0,2,1,1
    

    And the variant at position 544040 disappears altogether after GenotypeGVCFs3.2. I have also checked in the vicinity (+/- 100bp).

    It's not a widespread problem, but I don't understand, why CombineGVCFs converts the GT from 0/1 to ./.. Is this a bug or a feature?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    This sounds like a bug. We'll try to get @bgrenier's bug report processed in the very near future. Unfortunately we've been a little slow on bug fixes recently; the team is going through some re-organization that promises to lead to much awesomeness, but in the short term it has slowed us down a bit. Sorry about that!

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    @Geraldine_VdAuwera‌ Was this bug fixed between 3.2 and 3.3? Thanks for the release.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I believe the fix for this made it into the 3.3 release. Will try to confirm later today.

  • bgrenierbgrenier FranceMember

    I've tested the GATK3-3 version and the problem still persists

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @bgrenier‌

    Hi,

    I am sorry for the delay. The bug was a bigger issue than we first thought. I have spoken to the developer who fixed the bug, and the issue is now fixed in all formats, so the fix should be out soon.

    I will let you know which nightly build you can try again.

    Thanks for being patient.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @bgrenier‌

    Hi,

    The bug should be fixed in tonight's nightly build.

    -Sheila

  • bgrenierbgrenier FranceMember

    @Sheila

    Hi,

    Thanks. I've tested the nightly build from 20141107 and it works on my little example (but I didn't have the time to test it on a bigger one).

    When do you think this bug correction will be be part of the main GATK release ?

    Thank you

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @bgrenier I would expect the next release to happen no sooner than January 2015.

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    @Geraldine_VdAuwera said:
    I believe the fix for this made it into the 3.3 release. Will try to confirm later today.

    @Geraldine_VdAuwera‌ Did the fix make it into 3.3? I will test version 3.3 tomorrow, if you don't know. It was not available to me on our cluster the other day, when I had time to test it. As always thank you very much.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @tommycarstensen‌

    Hi,

    Unfortunately, the fix is not available in 3.3. The fix was a rather large one, so it only made it into the nightly build.

    -Sheila

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭
    edited November 2014

    @Sheila said:
    Unfortunately, the fix is not available in 3.3. The fix was a rather large one, so it only made it into the nightly build.

    @Sheila thanks for your reply. I tried running the nightly build (nightly-2014-11-14-g58cfab1).

    Here is one of the bams used as my HaplotypeCaller input:

    ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA12878/alignment/NA12878.mapped.ILLUMINA.bwa.CEU.low_coverage.20121211.bam
    

    Here is part of my HaplotypeCaller output (GT=0/1 and PL=78,0,62,84,69,152):

    20  544040  rs17770500  C   T,<NON_REF> 49.77   .   BaseQRankSum=1.026;DB;DP=4;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=0.000;ReadPosRankSum=1.026   GT:AD:GQ:PL:SB  0/1:2,2,0:62:78,0,62,84,69,152:0,2,1,1
    

    Here is part of my CombineGVCFs output (GT=./. and PL=78,0,62,84,69,152):

    20  544040  rs17770500  C   T,<NON_REF> .   .   BaseQRankSum=1.03;DP=754;MQ=60.00;MQ0=0;MQRankSum=0.00;ReadPosRankSum=1.03  GT:AD:DP:MIN_DP:PL:SB   ./.:2,2,0:.:.:78,0,62,84,69,152:0,2,1,1
    

    The PL field 78,0,62,84,69,152 supports GT 0/1, but it is converted to ./. by CombineGVCFs. Am I running CombineGVCFs incorrectly? My command line looks like this:

    $java7 -Djava.io.tmpdir=tmp -Xmx9900m \
     -jar ../build_nightly_2014-11-14/GenomeAnalysisTK.jar \
     --analysis_type CombineGVCFs \
     --reference_sequence $pathref/hs37d5.fa \
     -L $chrom \
     -V lists/CombineGVCFs.$chrom.$i.list \
     -o $out
    

    The HaplotypeCaller output for sample NA12878 is small enough (78MB), that I can upload it somewhere.

    Your feedback on this is much appreciated.

    Here is a highly confident call from high coverage data:

    ftp://ftp.ncbi.nih.gov/giab/ftp/data/NA12878/variant_calls/GIAB_integration/NIST_RTG_PlatGen_merged_highconfidence_v0.2_Allannotate.vcf.gz
    
    20  544040  .   C   T   4183.5  PASS    DP=869;DPR=0.998;PGC=5,2,2,2;PGDR=1.034,1.000,0.976,1.038;PHQ=44;source=NISTPASS,RTGPHQ;PHC GT:DP:DPR:RE:AR:RQ:GQ:ABP:SBP:RPB:PPB:PUR:RS:AD:GL:AVR:PS   0|1:55:1.092:0.082:0.000:628.7:629:1.93:1.45:0.36:0.28:0.02:C,31,0.046,T,24,0.036:31,24:-62.87,0.00,-96.84:0.6386:61098
    

    UnifiedGenotyper3.2 calls the variant correctly (PL=74,0,61):

    20  544040  rs17770500  C   T   21.88   .   AC=5;AF=1.278e-03;AN=3912;BaseQRankSum=-2.237;DB;DP=9310;Dels=0.00;FS=0.000;HaplotypeScore=0.2220;InbreedingCoeff=-0.0450;MLEAC=1;MLEAF=2.556e-04;MQ=59.12;MQ0=0;MQRankSum=-0.432;QD=1.04;ReadPosRankSum=1.567  GT:AD:DP:GQ:PL  0/1:2,2:4:61:74,0,61
    

    FreeBayes0.9.18 also calls it correctly (GL=-7.64903,0,-5.74903):

    20  544040  .   C   T   40.2227 .   AB=0.5;ABP=3.0103;AC=1;AF=0.5;AN=2;AO=2;CIGAR=1X;DP=4;DPB=4;DPRA=0;EPP=7.35324;EPPR=3.0103;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=1;NUMALT=1;ODDS=4.87905;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=85;QR=65;RO=2;RPL=1;RPP=3.0103;RPPR=3.0103;RPR=1;RUN=1;SAF=1;SAP=3.0103;SAR=1;SRF=0;SRP=7.35324;SRR=2;TYPE=snp;technology.ILLUMINA=1   GT:DP:RO:QR:AO:QA:GL    0/1:4:2:65:2:85:-7.64903,0,-5.74903
    

    And samtools1.1 calls the SNP correctly (PL=73,0,48):

    20  544040  .   C   T   14.4345 .   DP=9281;VDB=0.267186;SGB=-22.0731;RPB=0.277231;MQB=0.081293;MQSB=0.045318;BQB=0.176085;MQ0F=0;ICB=1;HOB=1.30553e-07;AC=1;AN=3914;DP4=4028,5037,6,3;MQ=48    GT:PL:DP:DV 0/1:73,0,48:4:2
    
  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    It will probably not be part of the stable release, but I get this error message, when I try to run the 2014-11-14 nightly build of GenotypeGVCFs:

    ##### ERROR MESSAGE: The provided variant file(s) have inconsistent references for the same position(s) at 20:3037872, T* vs. C*
    

    I am running GenotypeGVCFs on gVCFs that were created with the same reference panel and same nightly build of HaplotypeCaller and CombineGVCFs.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Well Tommy, you're scaring up a whole bunch of bugs for us these days :)

    Please do upload you test case file as described here, and @Sheila will process your bug for the devs to fix.

    I assume the inconsistent refs issue comes up with the same files involved in your Combine bug? Even if it's just in the nightly, I want to make sure it gets nipped in the bud.

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    @Geraldine_VdAuwera‌ Yeah, I guess I should watch my back, if I ever visit the GATK dev HQ. The GenotypeGVCFs bug does indeed come up with the same files involved in the CombineGVCFs bug, but I didn't have this issue with the stable 3.2 release following an almost identical procedure for the same data; only the VCF combination lists were different. That's why I assumed it was just an issue with the nightly build. I will upload a test case file as described. I will do this no later than Monday next week. The bug I am most interested in getting rid of is the one related to CombineGVCFs. For low coverage data (~4x) I'm currently seeing poor performance (sensitivity and false discovery rate) of HC relative to UG and other tools. I'll upload those test case files straight away!

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    Just one final comment. I ran HC3.2 with -ERC NONE instead of -ERC GVCF and the variant was correctly called from the low coverage NA12878 bam (i.e. GT=0/1=C/T):

    20 544040 rs17770500 C T 27.50 . AC=3;AF=7.680e-04;AN=3906;BaseQRankSum=0.553;DB;DP=9060;FS=0.000;InbreedingCoeff=-0.0436;MLEAC=1;MLEAF=2.560e-04;MQ=59.42;MQ0=0;MQRankSum=1.705;QD=2.12;ReadPosRankSum=0.492 GT:AD:GQ:PL 0/1:2,2:62:78,0,62

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Nah, you're safe. You provide good test cases with minimal fuss, and the bug-squashing makes the tools more robust. It's all good :)

    Technically the nightlies are supposed to be as thoroughly tested as the official build, so any mature components (as opposed to experimental features) should be bug-free in the nightlies too, unless we have untested corner cases -- so we definitely want to know if something is wrong.

    Feedback on performance results is welcome too if you see any issues. We want to stay ahead of the pack :)

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Definitely looks like CombineGVCFs is still misbehaving.

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭
    edited November 2014

    I couldn't replicate the GenotypeGVCFs error with just 2 samples, but I succeeded in the end with version 3.3 when using sample NA12878 and hundreds of samples from a cohort (more than 5x200 samples seemed to be the magic number). This is the name of the uploaded file containing a region of ~2000bp around the problematic coordinate 544040:

    bugreport.tc9.GenotypeGVCFs.tar.gz
    

    GenotypeGVCFs succeeds in converting the GT from ./. to 0/1, when I only run it on one sample. I therefore think GenotypeGVCFs is the culprit, when running multiple samples. I have however uploaded another file for replicating the error of the HaplotypeCaller output GT being converted from 0/1 to ./. by CombineGVCFs. This is the file name on the ftp server:

    bugreport.tc9.CombineGVCFs.tar.gz
    

    I have included a README.txt file in both gzipped tar files, which explains the folder contents and also links to this thread. Please let me know, if it is insufficient or unclear.

    I will follow up on this GenotypeGVCFs error message later:

    ##### ERROR MESSAGE: The provided variant file(s) have inconsistent references for the same position(s) at 20:3037872, T* vs. C*
    

    Thank you.

  • bgrenierbgrenier FranceMember
    edited November 2014

    Hi,

    I've tested CombineGVCFs on a bigger example using the nighlty build of 20141107 and I've noticed no problem on the result file.
    Unfortunately, I've found the same error than tommycarstensen at the GenotypeGVCFs step running GATK3.3, GATK3.2-2 or the nightlybuild of 20141107.:
    ERROR MESSAGE: The provided variant file(s) have inconsistent references for the same position(s) at 1:762304, G* vs. T*

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @bgrenier, can you please check and tell us if the files actually have those inconsistencies? This is to determine which tool is bugging, either writing out the wrong allele or misreading the input.

  • vifehevifehe SpainMember

    Hi,

    I have encountered the same problem as @bgrenier and @tommycarstensen‌

    ERROR MESSAGE: The provided variant file(s) have inconsistent references for the same position(s) at chr1:801977, G* vs. T*

    I've checked the two files at this position and they do have these inconsistencies.

    file 1:
    chr1 801977 . T
    file 2:
    chr1 801977 . G

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hey folks (@vihefe, @bcgrenier, @tommycarstensen‌)

    This bug is really troubling, so we'd like to determine ASAP what's up. I need each of you to please tell me the following:

    1. What version(s) were the faulty GVCFs made with?
    2. Were they output by HC or by CombineGVCF?
    3. Is the issue (faulty GVCF generation) reproducible, meaning if you re-call that region of that file or re-combine GVCFs (depending on 2), does the wrong base get written again?
    4. Does the issue affect more than one GVCF file, meaning is the faulty base present in more than one file?
    5. Does the issue occur at multiple sites, meaning if you exclude the site in the first error with -XL in your GenotypeGVCFs command, do you get another case of the error?
    6. Anything else you think might be relevant?

    Thanks in advance for helping us figure this out!

  • bgrenierbgrenier FranceMember
    edited November 2014

    Hi @Geraldine_VdAuwera‌ ,

    After running some tests using the nighlty build of 20141107, I have some more information.

    Example : I have 2 individuals after HC in GVCF mode from one first batch and 2 individuals after HC in GVCF mode from another batch. The goal is to compare the genotype with and without a merging step (merging individuals per batch).

    1) Without any merging, the GenotypeGVCFs step works perfectly.

    2) With a merging step :

    2.1) The first batch, after merging, has a G as reference allele for the problematic position :
    1 762304 . G <NON_REF> . . . GT:DP:GQ:MIN_DP:PL ./.:27:42:25:0,42,630 ./.:51:93:44:0,93,1395

    2.2) The second batch, after merging, has a T as reference allele for the problematic position
    1 762304 . T <NON_REF> . . DP=158 GT:AD:DP:MIN_DP:PGT:PID:PL:SB ./.:65,47:112:.:0|1:762303_GGTA_G:1847,0,2576:10,55,4,43 ./.:.:88:46:.:.:0,99,1485

    This sure leads to the error I gave on my previous comment at the GenotypeGVCFs step.

    3) I also noticed that one individual has a deletion in the second batch at the position 762303 (I've truncated the output so that it is more readable). The problematic position is "missing" (regarding the comment from Eric Banks) for that individual :
    1 762303 . GGTA G,<NON_REF> 1809.73 . ... GT:AD:DP:GQ:PGT:PID:PL:SB 0/1:65,47,0:112:99:0|1:762303_GGTA_G:1847,0,2576,2043,2724,4767:10,55,4,43 1 762307 . G <NON_REF> . . END=762308 GT:DP:GQ:MIN_DP:PL 0/0:109:0:109:0,0,534
    It seems that this is this indel that leads to the bug.

    I have prepared an archive with this little example if you want to test it.

    Thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @bgrenier,

    Thanks for the details -- this suggests to me that CombineGVCFs is at fault here. Please do upload your test case (instructions here as always) and hopefully we can get this fixed asap.

  • bgrenierbgrenier FranceMember

    @Geraldine_VdAuwera‌ ,

    I have uploaded the archive named bgrenier_combineGVCFs_20141120.tar.gz containing files, readme and command lines used.
    Please note that my previous archive was named bgrenier_combineGVCFs.tar.gz and doesn't represent the same bug.

    Thanks,

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Great, thanks! I deleted the old archive to avoid any confusion since we resolved that one. We need to do some cleaning...

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
    edited November 2014

    @tommycarstensen‌

    Hi!

    I put in your bug report. Hopefully I can let you know it is fixed asap.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @tommycarstensen‌

    Hi again,

    This is not actually a bug. There are only 2 reads out of 4 in NA12878 that have the alt allele. None of the other samples do, many of which have coverage closer to 20 or 30. This is a case where a mediocre variant disappears when you add more samples, due to the lack of evidence for the variant. It also may not be a real variant because it's not in the PCR-free BAMs.

    However, you can recover the variant site by setting -stand_emit_conf and -stand_call_conf to 0.

    -Sheila

  • vifehevifehe SpainMember

    @Geraldine_VdAuwera‌

    I'll answer your questions below

    Hey folks (@vihefe, bcgrenier, tommycarstensen‌)

    This bug is really troubling, so we'd like to determine ASAP what's up. I need each of you to please tell me the following:

    1. What version(s) were the faulty GVCFs made with?

    1 - I'm using GATK-nightly_build_2014-11-17

    1. Were they output by HC or by CombineGVCF?

    2 - Output by CombineGVCFs

    1. Is the issue (faulty GVCF generation) reproducible, meaning if you re-call that region of that file or re-combine GVCFs (depending on 2), does the wrong base get written again?

    3 - yes, it is reproducible - it provides same error

    1. Does the issue affect more than one GVCF file, meaning is the faulty base present in more than one file?

    4 - Initially I had 2 files of 200vcf that when running Genotype reported the mentioned error. Then I re-splitted my data into 4 files of 100vcf - run GenotypeGVCFs and got the very same error referring to the same positions.

    1. Does the issue occur at multiple sites, meaning if you exclude the site in the first error with -XL in your GenotypeGVCFs command, do you get another case of the error?

    5 - when running Genotype GVCFs, I gor another case error, and when exclucing, first and second, got third error

    1. Anything else you think might be relevant?

    Thanks in advance for helping us figure this out!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Thanks @vifehe, this is very helpful. We're prioritizing this bug fix since it's pretty bad. We'll keep you posted on the status.

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭
    edited November 2014

    @Geraldine_VdAuwera said:
    1. What version(s) were the faulty GVCFs made with?
    2. Were they output by HC or by CombineGVCF?
    3. Is the issue (faulty GVCF generation) reproducible, meaning if you re-call that region of that file or re-combine GVCFs (depending on 2), does the wrong base get written again?
    4. Does the issue affect more than one GVCF file, meaning is the faulty base present in more than one file?
    5. Does the issue occur at multiple sites, meaning if you exclude the site in the first error with -XL in your GenotypeGVCFs command, do you get another case of the error?
    6. Anything else you think might be relevant?

    1) nightly-2014-11-14-g58cfab1

    2) I got the error after CombineGVCFs, which seems to have generated the faulty gVCFs. All HaplotypeCaller output with a variant at the position seem to contain the correct REF.

    3) Yes, I get the same error.

    4) I see the wrong REF in the 10th of 10 gVCFs generated by CombineGVCFs. Each containing 200 samples, the last/tenth fewer and also containing NA12878.

    5) Not tested. Please let me know, if I should test.

    6) A big thank you to you and Sheila for your light-in-vaccum fast replies.

    Only a few of ~2000 samples had a call at the problematic site. I don't know how to interpret two and three PL values of zero in the HaplotypeCaller output. Not sure if this is somehow related to the bug.

    20  3037872 .   T   <NON_REF>   .   .   END=3037873 GT:DP:GQ:MIN_DP:PL  0/0:3:0:3:0,0,42
    20  3037872 .   T   <NON_REF>   .   .   END=3037872 GT:DP:GQ:MIN_DP:PL  0/0:2:3:2:0,3,45
    20  3037872 .   T   <NON_REF>   .   .   END=3037876 GT:DP:GQ:MIN_DP:PL  0/0:7:21:7:0,21,235
    20  3037872 .   T   <NON_REF>   .   .   END=3037872 GT:DP:GQ:MIN_DP:PL  0/0:8:3:8:0,3,45
    20  3037872 .   T   <NON_REF>   .   .   END=3037879 GT:DP:GQ:MIN_DP:PL  0/0:2:0:2:0,0,0
    20  3037872 .   T   <NON_REF>   .   .   END=3037872 GT:DP:GQ:MIN_DP:PL  0/0:3:3:3:0,3,45
    20  3037872 .   T   <NON_REF>   .   .   END=3037874 GT:DP:GQ:MIN_DP:PL  0/0:3:3:3:0,3,45
    20  3037872 .   T   <NON_REF>   .   .   END=3037873 GT:DP:GQ:MIN_DP:PL  0/0:4:0:4:0,0,61
    20  3037872 .   T   <NON_REF>   .   .   END=3037872 GT:DP:GQ:MIN_DP:PL  0/0:2:0:2:0,0,19
    20  3037872 .   T   <NON_REF>   .   .   END=3037879 GT:DP:GQ:MIN_DP:PL  0/0:6:0:5:0,0,0
    
  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    @Sheila Thanks for looking into this so quickly.

    @Sheila said:
    This is not actually a bug. There are only 2 reads out of 4 in NA12878 that have the alt allele. None of the other samples do, many of which have coverage closer to 20 or 30. This is a case where a mediocre variant disappears when you add more samples, due to the lack of evidence for the variant. It also may not be a real variant because it's not in the PCR-free BAMs.

    2 out of 4 sounds exactly heterozygous, no? I'll check the PCR-free high coverage BAM. This bug/feature suggests to me, that I should avoid running samples from different cohorts together; e.g. thousands of African samples with the CEU sample NA12878. I will see if I can do a similar evaluation of variant callers with the YRI sample NA19240 instead. All other callers I have tried call this variant (in)correctly; i.e. heterozygous. Would changing -hets from the default 0.001 make any difference?

    @Sheila said:
    However, you can recover the variant site by setting -stand_emit_conf and -stand_call_conf to 0.

    Here it is mentioned that those "filters are not applied in GVCF mode", which I'm guessing refers only to HaplotypeCaller:

    http://gatkforums.broadinstitute.org/discussion/4269/stand-emit-conf-and-stand-call-conf-setting-in-haplotypecaller
    

    I applied the filters to GenotypeGVCFs instead of HaplotypeCaller. The default is 30. If I use a value of 0, then nothing is emitted (output VCF has zero size), whereas variants are emitted, when I use a value of 1. I suspect this is a bug, no? Those two parameters are floats/doubles according to the documentation, so I will set them to 0.0000001 for the time being. I will send you the NA12878 ROC curves once GenotypeGVCFs, VariantRecalibrator, etc. runs to completion to confirm whether lowering the emit/call thresholds solves the problem and makes HaplotypeCaller in GVCF mode perform as well as existing tools. I hope VariantRecalibrator is able to handle the additional junk, which I'll be feeding it now. Thanks!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @vifehe‌ @tommycarstensen‌ @bgrenier‌

    Hi everyone,

    The bug for the inconsistent reference has been submitted. I hope this will be fixed asap. I will keep you updated.

    Thanks @bgrenier for submitting the simple test files.

    -Sheila

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    @tommycarstensen said:
    Sheila Thanks for looking into this so quickly.

    I applied the filters to GenotypeGVCFs instead of HaplotypeCaller. The default is 30. If I use a value of 0, then nothing is emitted (output VCF has zero size), whereas variants are emitted, when I use a value of 1. I suspect this is a bug, no?

    @Sheila scrap that. Output was written after job completion. It must have been written to a temporary file, while the job was running. Not the usual behaviour, but all is fine.

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    @Sheila said:
    The bug for the inconsistent reference has been submitted. I hope this will be fixed asap. I will keep you updated.

    Has Christmas come early this year? Has this bug been fixed in the latest nightly build? Or is it a difficult one to fix?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Ah, we just got back from Thanksgiving, so it's still a little early for Christmas, sorry. I don't expect it will be particularly difficult, it's just bad timing, between the US holiday schedule and a couple of priority items in the queue that need to be resolved -- but the ref bug will be next.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @juber @bgrenier @udp‌ @tommycarstensen‌ @vifehe‌

    Hi everyone,

    The issue of inconsistent references for the same position is now fixed. It will be available in tonight's nightly build: https://www.broadinstitute.org/gatk/nightly

    -Sheila

  • OprahOprah Member

    In summary, was this a bug with CombineGVCFs, or GenotypeGVCFs, or both?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Oprah‌

    Hi,

    It was in CombineGVCFs.

    -Sheila

  • bgrenierbgrenier FranceMember

    @Sheila,

    Thanks, I've tested the nighlty build 20141217 and it works on my little example. I will test it on a larger example.

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    @Sheila said:
    The issue of inconsistent references for the same position is now fixed. It will be available in tonight's nightly build: https://www.broadinstitute.org/gatk/nightly

    Thank you! Initial analysis points to all bugs having been fixed. Very happy! Very grateful for your quick turn-around!

    @Sheila said:
    This is not actually a bug. There are only 2 reads out of 4 in NA12878 that have the alt allele. None of the other samples do, many of which have coverage closer to 20 or 30. This is a case where a mediocre variant disappears when you add more samples, due to the lack of evidence for the variant. It also may not be a real variant because it's not in the PCR-free BAMs.

    This position (20:544040) and another coordinate (20:2532777) with a DP of 11 are both heterozygous in NA12878. They are now correctly called in GVCF mode along with ~2000 non-European samples with QUALs of 25.92 and 193755.52 and AFs of 2.581e-04 and 0.781, respectively.

    20  544040  .   C   T   25.92   .   AC=1;AF=2.581e-04;AN=3874;BaseQRankSum=1.03;DP=8604;GQ_MEAN=12.32;GQ_STDDEV=6.73;InbreedingCoeff=-0.0542;MLEAC=1;MLEAF=2.581e-04;MQ=60.00;MQ0=0;MQRankSum=0.00;NCC=50;QD=6.48;ReadPosRankSum=1.03;SOR=0.105 GT:AD:DP:GQ:PL  0/1:2,2:.:62:78,0,62
    
    20  2532777 .   A   G   193755.52   .   AC=2640;AF=0.781;AN=3380;BaseQRankSum=0.322;DP=7945;GQ_MEAN=17.34;GQ_STDDEV=18.76;InbreedingCoeff=0.2608;MLEAC=2795;MLEAF=0.827;MQ=60.00;MQ0=0;MQRankSum=0.406;NCC=297;QD=29.62;ReadPosRankSum=0.358;SOR=0.727  GT:AD:DP:GQ:PGT:PID:PL  0/1:9,2:.:39:.:.:39,0,305
    

    Tomorrow I will check, that HaplotypeCaller in GVCF mode at a QUAL threshold between 0 and 30 performs as well for SNPs called from low coverage as UnifiedGenotyper, samtools, FreeBayes, Platypus and HaplotypeCaller in non-GVCF mode. I will come back with tears in my eyes, if this is not the case. But I think I will stay happy and satisfied until next year and leave you alone until then. Thanks!

    P.S. I noticed PGT and PID being added to the INFO field. It is explained elsewhere how to get rid of them:

    http://gatkforums.broadinstitute.org/discussion/4682/pgt-and-pid-in-vcf
    

    P.P.S. CombineGVCFs is a wee bit slow and can't be threaded. But that's a different story for another thread for another time...

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    For my part, I hope it will be not tears, but stars in your eyes from HC-GVCF's wondrous performance! (or tears of joy, but maybe that's a little over the top)

    Happy holidays and good riddance to you ;)

    P.S. Y u no want physical phasing?

    P.P.S. CombineGVCFs is a wee bit of a pain in the arrr ahem hurrm. No further comments, Your Honor.

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    I took a break from building my snowman and I am now disturbing the Christmas peace for which Santa will surely punish me in the afterlife. The issue at coordinate 20:544040 was sorted, but I came across a lot of incorrect calls by HC in GVCF mode after the bug fix; i.e. nightly-2014-12-17-g582f471. I am calling variants from this bam file:

    ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/data/NA12878/alignment/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam
    

    I am comparing my calls to this file:

    ftp://ftp.ncbi.nih.gov/giab/ftp/data/NA12878/variant_calls/GIAB_integration/NIST_RTG_PlatGen_merged_highconfidence_v0.2_Allannotate.vcf.gz
    

    FreeBayes, samtools, UG and HC in normal multisample mode all make correct calls at two selected positions; 20:2026068:C:T:0/1 and 20:58262178:A:G:1/1. But HaplotypeCaller in GVCF mode fails for these two positions and a few thousand other sites on chromosome 20. Here is a GT comparison of HC in GVCF mode to UG:
    count UG HC
    6752 0/1 0/0
    1694 1/1 ./.
    1602 0/1 ./.
    177 0/1 1/1

    Here the calls at the two positions:

    20:2026068
    

    GiaB:

    20  2026068 .   C   T   4769.3  PASS    DP=931;DPR=1.069;PGC=3,2,4,2;PGDR=0.977,1.028,1.067,1.175;PHQ=44;source=NISTPASS,RTGPHQ;PHC GT:DP:DPR:RE:AR:RQ:GQ:ABP:SBP:RPB:PPB:PUR:RS:AD:GL:AVR:PS   0|1:55:1.092:0.098:0:555.8:556:4.78:1.65:1.93:0.02:0:C,33,0.048,T,22,0.050:33,22:-55.58,0,-104.3:0.6372:61098
    

    HC in GVCF mode:

    20  2026068 .   C   <NON_REF>   .   .   END=2026068 GT:DP:GQ:MIN_DP:PL  0/0:9:0:9:0,0,287
    

    CombineGVCFs:

    20  2026068 rs6136826   C   T,<NON_REF> .   .   BaseQRankSum=-4.060e-01;DP=840;MQ=60.00;MQ0=0;MQRankSum=0.720;ReadPosRankSum=0.406  GT:AD:DP:MIN_DP:PL:SB   ./.:.:9:9:0,0,287,0,287,287
    

    GenotypeGVCFs:

    20  2026068 .   C   T   17932.67    .   AC=256;AF=0.066;AN=3868;BaseQRankSum=-5.500e-01;DP=8490;GQ_MEAN=15.86;GQ_STDDEV=18.12;InbreedingCoeff=0.0660;MLEAC=264;MLEAF=0.068;MQ=60.00;MQ0=0;MQRankSum=0.358;NCC=53;QD=16.11;ReadPosRankSum=0.406;SOR=0.615    GT:AD:DP:GQ:PL  0/0:9,0:9:0:0,0,287
    

    samtools:

    20  2026068 rs6136826   C   T   999 .   AC=299;AF=0.077;AN=3902;BQB=0;DB;DP=8964;DP4=3812,4125,356,370;FS=0.000;HOB=0.00559282;HaplotypeScore=0.1706;ICB=0.777892;InbreedingCoeff=0.0673;MQ=59.32;MQ0F=0;MQB=0;MQRankSum=0.752;MQSB=0.906879;NCC=36;QD=0.70;RPB=0.81009;ReadPosRankSum=0.465;SGB=379.929;VDB=0.592826   GT:DP:DV:PL 0/1:9:1:15,0,205
    

    FB:

    20  2026068 .   C   T   14563.6 .   AB=0.511329;ABP=4.48638;AC=306;AF=0.078341;AN=3906;AO=768;CIGAR=1X;DP=8963;DPB=8963;DPRA=1.0685;EPP=4.92165;EPPR=68.334;GTI=46;LEN=1;MEANALT=1.01003;MQM=58.9414;MQMR=59.1423;NS=1953;NUMALT=1;ODDS=0.012365;PAIRED=0.996094;PAIREDR=0.995598;PAO=0;PQA=0;PQR=0;PRO=0;QA=24285;QR=274384;RO=8178;RPL=368;RPP=5.9056;RPPR=25.0341;RPR=400;RUN=1;SAF=369;SAP=5.55499;SAR=399;SRF=3902;SRP=40.151;SRR=4276;TYPE=snp;technology.ILLUMINA=1  GT:DP:RO:QR:AO:QA:GL    0/1:9:8:296:1:42:-2.44497,0,-25.255
    

    UG:

    20  2026068 rs6136826   C   T   18722.8 .   AC=350;AF=0.09;AN=3904;BaseQRankSum=-16.847;DB;DP=8936;Dels=0;FS=0;HaplotypeScore=0.1702;InbreedingCoeff=0.0684;MLEAC=321;MLEAF=0.082;MQ=59.38;MQ0=0;MQRankSum=-0.903;QD=13.09;ReadPosRankSum=0.372 GT:AD:DP:GQ:PL  0/1:8,1:9:16:16,0,289
    

    HC normal multisample mode:

    20  2026068 rs6136826   C   T   19358.1 .   AC=347;AF=0.089;AN=3898;BaseQRankSum=-15.889;DB;DP=8746;FS=0;InbreedingCoeff=0.0747;MLEAC=338;MLEAF=0.087;MQ=59.57;MQ0=0;MQRankSum=0.134;QD=13.9;ReadPosRankSum=3.605   GT:AD:GQ:PL0/1:8,1:18:18,0,303
    
    20:58262175
    

    GiaB:

    20  58262175    .   GATA    GATG    25847.4 PASS    DP=834;DPR=0.957;PGC=2,2,5,2;PGDR=1.004,1.033,0.978,0.943;PHQ=44;source=NISTPASS,RTGPHQ;PHC;XRX GT:DP:DPR:RE:AR:RQ:GQ:ABP:SBP:RPB:PPB:PUR:AD:GL:AVR 1/1:48:0.953:0.21:0:1630.6:249:0.18:0.42:2.09:0:0:0,47,0:-163.06,-25.96,0,-169.41,-25.09,-293.73:0.3751
    

    HC in GVCF mode:

    20  58262178    .   A   <NON_REF>   .   .   END=58262178    GT:DP:GQ:MIN_DP:PL0/0:7:0:7:0,0,0
    

    CombineGVCFs:

    20  58262178    rs6015549   A   G,<NON_REF> .   .   DP=941;MQ=60;MQ0=0  GT:AD:DP:MIN_DP:PL:SB   ./.:.:7:7:0,0,0,0,0,0:.
    

    GenotypeGVCFs:

    20  58262178    .   A   G   361592  .   AC=3249;AF=0.976;AN=3328;BaseQRankSum=1.75;DP=10422;GQ_MEAN=17.94;GQ_STDDEV=11.85;InbreedingCoeff=0.2607;MLEAC=3304;MLEAF=0.993;MQ=60;MQ0=0;MQRankSum=0.72;NCC=323;QD=24.95;ReadPosRankSum=0;SOR=0.736  GT:AD:DP:GQ:PGT:PID:PL  ./.:7,0:7:.:.:.:.
    

    samtools:

    20  58262178    rs6015549   A   G   999 .   AC=3890;AF=1;AN=3890;BQB=1;DB;DP=10778;DP4=0,1,4569,5113;FS=0;HaplotypeScore=0.6135;InbreedingCoeff=-0.0392;MQ=58.29;MQ0F=0;MQB=1;MQRankSum=2.235;MQSB=7.12528e-08;NCC=42;QD=0.09;RPB=1;ReadPosRankSum=0.063;SGB=-581.615;VDB=0.467269  GT:DP:DV:PL 1/1:7:7:152,21,0
    

    FB:

    20  58262174    .   TGATA   TGATG   313875  .   AB=0;ABP=0;AC=3898;AF=1;AN=3898;AO=9915;CIGAR=4M1X;DP=10057;DPB=10489;DPRA=0;EPP=3.27859;EPPR=5.18177;GTI=10;LEN=1;MEANALT=1.07081;MQM=57.9754;MQMR=60;NS=1949;NUMALT=1;ODDS=169.748;PAIRED=0.994856;PAIREDR=1;PAO=429.5;PQA=12653.5;PQR=6270.5;PRO=219.5;QA=341886;QR=34;RO=1;RPL=4955;RPP=3.01578;RPPR=5.18177;RPR=4960;RUN=1;SAF=4522;SAP=169.159;SAR=5393;SRF=0;SRP=5.18177;SRR=1;TYPE=snp;technology.ILLUMINA=1    GT:DP:RO:QR:AO:QA:GL    1/1:7:0:0:4:134:-12.16,-1.20412,0
    

    UG:

    20  58262178    rs6015549   A   G   375079  .   AC=3895;AF=1;AN=3896;BaseQRankSum=2.809;DB;DP=10745;Dels=0.03;FS=0;HaplotypeScore=0.5316;InbreedingCoeff=-0.0364;MLEAC=3896;MLEAF=1;MQ=58.34;MQ0=0;MQRankSum=2.116;QD=30.54;ReadPosRankSum=0.072    GT:AD:DP:GQ:PL  1/1:0,7:7:21:245,21,0
    

    @tommycarstensen said:
    I don't know how to interpret two and three PL values of zero in the HaplotypeCaller output. Not sure if this is somehow related to the bug.

    I notice the variant that I expect to be 0/1 has a PL of 0,0,287 after HC and the variants I expect to be 1/1 has a PL of 0,0,0 after HC.

    @Geraldine_VdAuwera said:
    P.S. Y u no want physical phasing?

    The PGT and PID annotations just seem to be missing most of the time. I'm very minimalist and feng shui when it comes to decorating the interior of my VCF files :)

    Happy 2015 to you at the Broad and sorry for starting it this way.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @tommycarstensen‌

    Hi,

    Can you post IGV screenshots of the bamout files for the two positions. It looks like there is not much evidence for the first call (8 reference alleles and only 1 alternate allele). For the second call, I did not get the Haplotype Caller normal mode output, but I wonder if the realignment moved the reads.

    Thanks,
    Sheila

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭
    edited December 2014

    @Sheila, thanks for your unexpected reply.

    Here is the HC -ERC NONE multisample output:

    20  58262178    rs6015549   A   G   366360  .   AC=3803;AF=0.97;AN=3922;BaseQRankSum=0.951;DB;DP=10702;FS=0;InbreedingCoeff=0.059;MLEAC=3810;MLEAF=0.971;MQ=58.71;MQ0=0;MQRankSum=1.605;QD=34.43;ReadPosRankSum=-0.274  GT:AD:GQ:PL 1/1:0,6:18:216,18,0
    

    I can't produce the IGV screenshots until 2015, but I ran HC with -ERC BP_RESOLUTION from position i-1000 to i+1000 on chrom20, where i is 58262178. The AD definitely supports GT=1/1, but the GT is 0/0.

    zcat 58262178.BP_RESOLUTION.vcf.gz | grep 58262178
    20  58262178    .   A   <NON_REF>   .   .   .   GT:AD:DP:GQ:PL  0/0:0,7:7:0:0,0,0
    

    And here is a samtools tview "screenshot" of the HC-realigned bam after running HC with -bamout, --bamWriterType ALL_POSSIBLE_HAPLOTYPES, --disableOptimizations, -forceActive and -dontTrimActiveRegions.

       58262181  58262191  58262201  58262211  58262221  58262231  58262241
    AATGATCATGGTGGTGACAGTGATGATGACGATGGTGATTGTGATGGTGGTGGTGATAGTGGTGATGACGATGATGATCA
    R...............................................................................
    ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
    G.........    ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
    G..........                                    .................................
    G.............                                                          ,,,,,,,,
    g,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
    G..................................
    

    The base quality is good and all reads have good mapping quality (i.e. greater than 30).

    20:58262178 is an example of a homALT site called as homREF. I'll try to find a better het site example other than 20:2026068.

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    @Sheila, I found a good example of a heterozygous site being called as homREF by HaplotypeCaller in GVCF mode. 20:2405838:A:T:0/1 is the site.

    Here the correct call by HC in normal multisample mode (GT=0/1, AD=4,3):

    20      2405838 rs8126371       A       T       84985.1 .       AC=1311;AF=0.34;AN=3852;BaseQRankSum=48.837;DB;DP=8142;FS=0.549;InbreedingCoeff=0.085;MLEAC=1317;MLEAF=0.342;MQ=59.38;MQ0=0;MQRankSum=-1.146;QD=19.47;ReadPosRankSum=0.469      GT:AD:GQ:PL     0/1:4,3:90:90,0,125
    

    Here the incorrect call by HC in GVCF mode (GT=0/0, AD=8,0):

    20  2405838 .   A   T   72615.3 .   AC=894;AF=0.268;AN=3342;BaseQRankSum=1.03;DP=6848;FS=1.893;GQ_MEAN=19.06;GQ_STDDEV=25.13;InbreedingCoeff=0.1361;MLEAC=1056;MLEAF=0.316;MQ=60;MQ0=0;MQRankSum=0.358;NCC=316;QD=22.57;ReadPosRankSum=0.406    GT:AD:DP:GQ:PL  0/0:8,0:8:0:0,0,72
    

    Here UG in multisample mode (0/1):

    20  2405838 rs8126371   A   T   81450.9 .   AC=1312;AF=0.34;AN=3856;BaseQRankSum=51.013;DB;DP=8235;Dels=0;FS=1.16;HaplotypeScore=0.132;InbreedingCoeff=0.0826;MLEAC=1245;MLEAF=0.323;MQ=59.21;MQ0=0;MQRankSum=-0.557;QD=18.46;ReadPosRankSum=0.532  GT:AD:DP:GQ:PL  0/1:5,3:8:88:88,0,159
    

    Here GiaB (1|0):

    20  2405838 .   A   T   6918.6  PASS    DP=845;DPR=0.97;PGC=3,2,4,2;PGDR=1.159,0.815,0.922,1.122;PHQ=44;source=NISTPASS,RTGPHQ;PHC  GT:DP:DPR:RE:AR:RQ:GQ:ABP:SBP:RPB:PPB:PUR:RS:AD:GL:AVR:PS   1|0:43:0.854:0.061:0:536.3:536:0.45:7.65:0.05:0.12:0:A,23,0.033,T,20,0.028:23,20:-53.63,0,-74.62:0.6404:61098
    

    Here samtools (0/1):

    20  2405838 .   A   T   999 .   DP=8204;VDB=0.832344;SGB=326.277;RPB=0.00450153;MQB=0;MQSB=0.991014;BQB=0;MQ0F=0;ICB=0.288722;HOB=0.0177483;AC=1160;AN=3846;DP4=2837,2507,1293,1206;MQ=47   GT:PL:DP:DV 0/1:5,0,125:6:1
    

    Here FreeBayes (0/1):

    20  2405838 .   A   T   67276.7 .   AB=0.491669;ABP=5.00021;AC=1270;AF=0.329016;AN=3860;AO=2765;CIGAR=1X;DP=8283;DPB=8283;DPRA=1.09695;EPP=171.363;EPPR=334.764;GTI=43;LEN=1;MEANALT=1.00201;MQM=58.7443;MQMR=58.833;NS=1930;NUMALT=1;ODDS=0.00414513;PAIRED=0.992405;PAIREDR=0.992386;PAO=0;PQA=0;PQR=0;PRO=0;QA=92660;QR=168521;RO=5516;RPL=1402;RPP=4.20481;RPPR=4.83062;RPR=1363;RUN=1;SAF=1394;SAP=3.42575;SAR=1371;SRF=2893;SRP=31.7087;SRR=2623;TYPE=snp;technology.ILLUMINA=1   GT:DP:RO:QR:AO:QA:GL    0/1:8:5:163:3:101:-8.76661,0,-14.3359
    

    Here HC in BP_RESOLUTION mode:

    20  2405838 .   A   <NON_REF>   .   .   .   GT:AD:DP:GQ:PL  0/0:5,3:8:0:0,0,72
    
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @tommycarstensen‌

    Hi,

    I am wondering why there is a difference in AD between GVCF mode and BP_RESOLUTION mode. I will wait to see the IGV screenshots before asking for a snippet.

    -Sheila

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭
    edited January 2015

    @Sheila, I did notice the AD difference between GVCF and BP_RESOLUTION. I assumed it was due to different local realignment being carried out (unintentionally). I have attached the GVCF and BP_RESOLUTION IGV screenshots of coordinate 20: 2405838. I'm not sure how to color by mapping quality and base quality. I don't quite understand, why the DP in the output VCF is so much lower than what the IGV screenshot shows me in the output BAM. I have also attached all of the output VCFs and BAMs after HC realignment.

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    I really want to use HC for a large project, which I'm working on, but I was told today, that I only have until Monday next week to fix the problems with HC, which causes it to be inferior to other methods in GVCF mode for low coverage WGS data. That's most likely not going to happen. I will therefore be forced to switch to either UG or samtools. In this context I have two questions, which I'm happy to ask in a different thread, if they don't belong here. Will UG still be part of GATK3.4? Will 3.4 have cram support? Thanks.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @tommycarstensen‌

    Hi,

    Thanks for the files. I just checked both positions, and I think I see what is going on.

    Position 58262178 has only 5 reads that have the alternate allele G. Although all 5 reads support the alternate allele, 4 are on the forward strand and 1 is on the reverse strand. 5 reads is not a lot of evidence by itself, but also the strand bias causes Haplotype Caller to not consider this a real SNP.

    Position 2405838 has only 6 reads, and only 2 of the 6 reads contain the alternate allele. This is not enough evidence for Haplotype Caller to call an alternate allele there, not to mention both are on the reverse strand.

    Please note the double or triple zero PLs show that the caller is essentially declaring its inability to decide on a call due to the very low/biased coverage. In such a case, the interpretation is not that the site is invariant, but that it is impossible to decide without more/better data.

    The big thing to understand here is that BP_RESOLUTION and GVCF are intermediate files which will not have the same call as multisample output. Even if 1 sample does not have enough evidence for the call, when other samples have the same alternate allele, that adds evidence which will allow the call to be made in multisample output. The point of joint analysis is to add evidence from other samples that may have been missed in 1 sample alone, and that intermediate GVCF and BP_RESOLUTION are not the end result.

    I am not sure if you can color by mapping quality, but you can choose a mapping quality threshold by going to View > Preferences > Alignments. Note there is a box to check for flagging zero quality alignments, so mapping quality zero alignments will be shaded white.

    You can color by base quality by right clicking and choosing Shade Base By Quality. This page may help: http://www.broadinstitute.org/software/igv/Preferences#Alignments

    The DP in the output vcf is so much lower than what you see in IGV because the bamout file shows the artificial haplotypes created by Haplotype Caller. The artificial haplotypes are the most likely haplotypes chosen after graph reassembly. https://www.broadinstitute.org/gatk/guide/article?id=4146
    You can color the sample reads and the artificial haplotypes by right clicking, choosing Color Alignments By > Sample.

    -Sheila

    P.S. I hope this long answer is helpful! Also, it seems like a lot of users are confused about joint calling and how it affects output. If I can, I would like to use your data to make a document to explain joint calling better. Please let me know if I can use your data. Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @tommycarstensen Do I understand correctly that what you're seeing is decreased sensitivity? And can you confirm that this is measured after joint genotyping with the rest of your cohort, rather than on HC's raw GVCF output?

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭
    edited January 2015

    @Geraldine_VdAuwera said:
    tommycarstensen Do I understand correctly that what you're seeing is decreased sensitivity?

    Yes, I'm afraid that's exactly what I'm seeing. See the attached plot. I'm missing ~7k of 63k heterozygous sites on chromosome 20, if I remember correctly. Sensitivity is better for UG, FreeBayes, samtools and HC in normal mode.

    And can you confirm that this is measured after joint genotyping with the rest of your cohort, rather than on HC's raw GVCF output?

    Yes, this is after HC in GVCF mode and after GenotypeGVCFs. I have even tried setting -stand_call_conf and -stand_emit_conf to zero.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    OK, I'm going to discuss this with the devs this afternoon, because this is something we want to address. I doubt that we can resolve this by Monday, but maybe I'm missing something. And I'll get you answers to the CRAM question (UG will still be in 3.4).

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    @Sheila said:

    If I can, I would like to use your data to make a document to explain joint calling better. Please let me know if I can use your data. Thanks!

    All of the data is 1000G NA12878, so you can share all of it. I don't think I shared any of the African data with you. If I did, then I need to ask for permission. I don't expect it to be a problem for a few thousand base pairs.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    @Sheila said:

    Also, it seems like a lot of users are confused about joint calling and how it affects output. If I can, I would like to use your data to make a document to explain joint calling better. Please let me know if I can use your data. Thanks!

    I just got the green light. Go ahead and use everything I have uploaded to your ftp.

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭
    edited January 2015

    @Sheila said:

    I hope this long answer is helpful!

    It is. Also thanks for explaining the double and triple zero PLs and IGV to me.

    @Sheila said:

    [...] but also the strand bias causes Haplotype Caller to not consider this a real SNP.

    I thought only VariantRecalibrator took into consideration strand bias via FisherStrand and the new annotation, which I forgot the name of. If strand bias is the HC culprit, then it would be great to have an argument to relax this parameter.

    @Sheila said:

    The big thing to understand here is that BP_RESOLUTION and GVCF are intermediate files which will not have the same call as multisample output. Even if 1 sample does not have enough evidence for the call, when other samples have the same alternate allele, that adds evidence which will allow the call to be made in multisample output. The point of joint analysis is to add evidence from other samples that may have been missed in 1 sample alone, and that intermediate GVCF and BP_RESOLUTION are not the end result.

    I'm thinking of calling sample NA12878 along with the rest of the 1000G CEU samples (instead of 1986 African samples) to prove that HaplotypeCaller in GVCF mode does well, when your cohort is somewhat homogenous. Kindly let me know, if you have already shown the sensitivity of HC in GVCF mode to be equal to or better than UG for low coverage (4x-8x) data. My problem would however still be that I can't choose a reasonable VQSR TruthSensitivity threshold, if I don't call along with sample NA12878 for which I have a highly curated truth set to evaluate the ROC curve with. Thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I thought only VariantRecalibrator took into consideration strand bias via FisherStrand

    You were correct, HC itself does not look at strand bias when assessing variants. Somehow we got ourselves confused about that, but we've sorted it out with the devs -- strand bias only comes into the picture at the filtering / recalibration stage.

    sensitivity of HC in GVCF mode to be equal to or better than UG for low coverage (4x-8x) data

    This is where the rub is; all our analyses are done on 30x coverage data, which is what the Broad platform generates in production by default. I had not realized until my meeting with the devs today that we had this implicit condition. My bad for not getting a better definition of what the supported bounds are. I'm sorry this has caused you so much trouble!

    I'm now putting in a request for an analysis of what sensitivity can be achieved depending on depth, along with a feature request to enable users to dial up sensitivity at low depths. None of this will be done by your deadline of course; sorry about that.

    CRAM work is in progress but I can't give you an ETA at all. That said, a 3.4 release is a while down the road still, so I'd hazard a guess that it will have cram support when it does eventually come out.

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    @Geraldine_VdAuwera said:
    (UG will still be in 3.4).

    @Geraldine_VdAuwera said:
    a 3.4 release is a while down the road still, so I'd hazard a guess that it will have cram support when it does eventually come out.

    Thanks! My concern was mainly that I'll be forced to switch to store cram and you will abandon UG before GATK has cram support. Worst case I can convert from cram to bam when needed.

    @Geraldine_VdAuwera said:
    I'm now putting in a request for an analysis of what sensitivity can be achieved depending on depth, along with a feature request to enable users to dial up sensitivity at low depths.

    Thanks. Very cool. Especially the latter. Have a nice weekend.

    I have a small UG question to ask and a minor bug to report, but I will give the whole context. I have cohorts at low and high coverage. I will probably call my low coverage data with UG and my high coverage data with HC. I will probably run VQSR per cohort. I will then create a union set of variants and recall all samples with GENOTYPE_GIVEN_ALLELES. I will probably use UG and HC for the low and high coverage recalling, respectively. Prior to generating the union set I will need to merge the UG outputted SNPs and InDels at the same position. I can merge SNPs and InDels with bcftools:

    bcftools norm -m +any
    

    I remember from another thread that UG treats SNPs and InDels separately:

    http://gatkforums.broadinstitute.org/discussion/3375/combine-snp-and-indel-vcf
    

    But I just checked that UG can do SNP and InDels calls at the same position, when an alleles file with merged SNPs and InDels is provided, when I use GENOTYPE_GIVEN_ALLELES. I expected it to fail, but it seems to work fine. SNPs and InDels are emitted at the same site/record/line. For one record and 3 samples (HOMREF, HET, HOMALT) the genotypes are correct. Can I trust the PLs? Can I also trust QUAL and other calculated values? Or is UG not meant to be used for calling SNPs and InDels at the same site with GGA?

    I only had a minor issue despite tabix indexing my bgzipped known alleles file:

    ##### ERROR MESSAGE: An index is required, but none found., for input source: alleles.vcf.gz
    
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @tommycarstensen Can you please post this latest as a new question, since it's a rather different topic? Thanks!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @tommycarstensen‌

    Hello,

    I am going to submit a bug report for position 58262178, because there are 5 reads all containing the alternate allele, but the artificial haplotype does not contain the alternate allele. Can you please submit a snippet of the original bam file around that position? Thanks!

    -Sheila

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    @Sheila said:
    I am going to submit a bug report for position 58262178, because there are 5 reads all containing the alternate allele, but the artificial haplotype does not contain the alternate allele. Can you please submit a snippet of the original bam file around that position?

    Thanks! The original bam is publicly available:

    ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/data/NA12878/alignment/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam
    
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @tommycarstensen‌

    Hi,

    I have put in the bug report for position 58262178, where there are 10 alternate reads that are not getting called. However, you reported that the site gets called when other samples are added. Can you please upload some GVCF snippets from other samples that make the allele get called in GenotypeGVCFs? It will help me make the case stronger that this is indeed a bug.

    Thanks!

    -Sheila

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭
    edited January 2015

    @Sheila‌ thanks for following up on this.

    I was never able to call the NA12878 variant at coordinate 20:58262178 in GVCF mode. It only worked with -ERC NONE in combination with 2000 African samples. See this comment. Instead of having to share all of my bam files I tried to call NA12878 along with the other 1000G CEU samples:

    ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/*/alignment/*.mapped.ILLUMINA.bwa.CEU.low_coverage.20130415.bam
    

    But that didn't call the variant because it is homALT in all 1000G populations according to Ensembl.

    Only when I call 18 homREF samples in my cohort with all the homALT CEU samples am I able to call the variant. I will generate bam snippets for those 18 samples from my cohort and upload. The CEU bams are from 1000G. The ftp also holds chromosome 20 bams, if you don't have all of the 1000G bams on your file system. I think they have been mapped. Yes, they have:

    samtools view -H ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA12878/alignment/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam | grep -m1 bwa
    @PG ID:bwa_index    PN:bwa  VN:0.5.9-r16    CL:bwa index -a bwtsw $reference_fasta
    

    GATK PrintReads was much faster at generating the bam snippet than samtools. Weird. Probably a user error. I uploaded it as

    tommycarstensen_18Africans_4x_20_58261678-58262678.bam
    

    along with the index file. If you call it with the 1000G CEU bams, then you will be able to call the variant with HC3.3 -ERC NONE.

    Let me know, if I can contribute with more to have this enhancement bug fixed for version 3.4. Thank you as always.

    Now, what I haven't checked is that HC in GVCF mode fails to call the variant with this sample setup... Given that PrintReads is so fast I will create a bam snippet for the 2000 African samples and NA12878 if you prefer that. It's easier to deal with one file for you. Need to get some sleep first though. Sleep is for the weak and it's really easy and fast to generate the bam snippet, so I uploaded the file as:

    tommycarstensen_NA12878_and_1986Africans_4x_chrom20_58261678-58262678.bam
    
    Post edited by tommycarstensen on
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @tommycarstensen,

    The devs finally took a closer look at your bug report (with the 5 alt reads). Here's the detailed diagnosis:

    This is a pretty repetitive region and the depth over that site is pretty low. In the pileup from the BWA mapping there are only 5 reads with the alt. 4 of them can be realigned to other parts of the graph and then the last one gets pruned out of the graph. After pruning there's only the reference haplotype, so no variation. If we force genotyping with -disableOptimizations then we can get a bamout, but the reads revert to their original alignments because there's still only the reference haplotype. When we genotype at the possible SNP, we're genotyping against the reference, so the likelihoods favor the non_ref. However, there's a capByHomRefLikelihood function that gets called, so everything gets set to zero because the HomRef likelihood is the lowest.

    In a nutshell, this is the equivalent of a gallic shrug and a noncommittal "meh!", concluding that the site is just not callable with low coverage. There's nothing we can do about it at present and no obvious tweaks that would help. Sorry... (sad face)

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    Well that sucks @Geraldine_VdAuwera, but no despair :) Hopefully this is the last time I will be working with low coverage data. I will take UG for one last dance instead. Thanks for looking into this so thoroughly and for not abandoning the question!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @tommycarstensen
    Hi,

    A quick point to note. Whenever you have low coverage, it is a good idea to try setting -minPruning 0. This will allow the graph to recover any pruned edges. Of course, whenever you make the graph building include more evidence, it has the potential to increase runtime. So, it is best to only do this in regions of low coverage. I tested it out on your data, and with -minPruning 0, the variant is called.

    -Sheila

  • weiqiangweiqiang Member

    @Sheila

    Hi Sheila,
    Using the GATK3.3.0, I have the same problem about the missing variants, although they are good coverage for these variants. There may be a problem about the g.vcf **files generated by **HaplotypeCaller.

    I use the 2015-03-30-g956b06d version from https://www.broadinstitute.org/gatk/nightly, but there are still the same issue. Now I want to know which version can fix the issue.

    Thank you very much.

    Qiang

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @weiqiang
    Hi Qiang,

    Can you explain the issue more in depth? Is there one particular sample out of many that has a variant that is not getting called, but the other samples do call it? Can you also post IGV screenshots of the input bam file and the bamout file?

    Thanks,
    Sheila

  • weiqiangweiqiang Member
    edited March 2015

    @Sheila
    Yes, variants in some samples can be called, while the same variants in other samples cannot be called.

    I simulated some fastq files based on a vcf file with the known variants. So I know the true variants in these positions. When I use the GATK3.3.0 to call these variants, I find that some true variants are not calling. Then I check the process and find g.vcf file is wrong in the positions like what Juber said.

    Juber said "For example, we were looking for position 866438 on chromosome 1 and these are the relevant lines in the gvcf:"

    But when I use the UnifiedGenotyper of GATK2.5 to call these variants, these variants can be found. So I think there may be a big bug in getting the g.vcf file using HaplotypeCaller of GATK3.3.0 or GATK3.2.0.

    My datasets are 30x coverage,
    For example, the position of one variant is 245916873 of chromosome1.

    1. the screenshot of bam.file is in the attachment using samtools tview.
      2.1 g.vcf with HaplotypeCaller of GATK3.3.0

    1 245916841 . C . . END=245916841 GT:DP:GQ:MIN_DP:PL 0/0:33:90:33:0,90,1350
    1 245916842 . T . . END=245916842 GT:DP:GQ:MIN_DP:PL 0/0:33:87:33:0,87,1305
    1 245916843 . ATGGGTGAGCTGGGGCAGACAGTCCTCAACAATGTACGGAGAGGGCCCTGTGTGGCCCTGGGCACAGCCCTTCAGCAC A, 0 . BaseQRankSum=-0.098;ClippingRankSum=0.293;DP=42;MLEAC=0,0;MLEAF=0.00,0.00;MQ=60.00;MQ0=0;MQRankSum=0.195;ReadPosRankSum=-2.122 GT:AD:DP:GQ:PL:SB 0/0:39,3,0:42:99:0,110,15398,117,15402,15410:20,19,0,0
    1 245916921 . T . . END=245916924 GT:DP:GQ:MIN_DP:PL 0/0:26:60:26:0,60,900
    1 245916925 . T . . END=245916925 GT:DP:GQ:MIN_DP:PL 0/0:28:59:28:0,59,643

    2.2 vcf with HaplotypeCaller of GATK3.3.0. Using many samples together to calling

    CHROM POS ID REF ALT QUAL FORMAT HG00108
    1 245916873 rs1934566 A G 4006.41 GT:AD:DP:GQ:PL 0/0:39,0:42:99:0,117,15410

    1. the result with UnifiedGenotyper of GATK2.5

    CHROM POS ID REF ALT QUAL FORMAT HG00108
    1 245916873 rs1934566 A G 6611.42 GT:AD:DP:GQ:PL 0/1:15,13:29:99:240,0,265

    Now, you can see the difference in the same position between GATK2.5.0 and GATK3.2.2.

    Thank you for your help.

    Best,
    Qiang

  • weiqiangweiqiang Member

    @Sheila
    Yes, variants in some samples can be called, while the same variants in other samples cannot be called.

    I simulated some fastq files based on a vcf file with the known variants. So I know the true variants in these positions. When I use the GATK3.3.0 to call these variants, I find that some true variants are not calling. Then I check the process and find g.vcf file is wrong in the positions like what Juber said.

    Juber said "For example, we were looking for position 866438 on chromosome 1 and these are the relevant lines in the gvcf:"

    But when I use the UnifiedGenotyper of GATK2.5 to call these variants, these variants can be found. So I think there may be a big bug in getting the g.vcf file using HaplotypeCaller of GATK3.3.0 or GATK3.2.0.

    My datasets are 30x coverage,
    For example, the position of one variant is 245916873 of chromosome1.

    the screenshot of bam.file is in the attachment using samtools tview.
    2.1 g.vcf with HaplotypeCaller of GATK3.3.0
    1 245916841 . C . . END=245916841 GT:DP:GQ:MIN_DP:PL 0/0:33:90:33:0,90,1350
    1 245916842 . T . . END=245916842 GT:DP:GQ:MIN_DP:PL 0/0:33:87:33:0,87,1305
    1 245916843 . ATGGGTGAGCTGGGGCAGACAGTCCTCAACAATGTACGGAGAGGGCCCTGTGTGGCCCTGGGCACAGCCCTTCAGCAC A, 0 . BaseQRankSum=-0.098;ClippingRankSum=0.293;DP=42;MLEAC=0,0;MLEAF=0.00,0.00;MQ=60.00;MQ0=0;MQRankSum=0.195;ReadPosRankSum=-2.122 GT:AD:DP:GQ:PL:SB 0/0:39,3,0:42:99:0,110,15398,117,15402,15410:20,19,0,0
    1 245916921 . T . . END=245916924 GT:DP:GQ:MIN_DP:PL 0/0:26:60:26:0,60,900
    1 245916925 . T . . END=245916925 GT:DP:GQ:MIN_DP:PL 0/0:28:59:28:0,59,643

    2.2 vcf with HaplotypeCaller of GATK3.3.0. Using many samples together to calling

    CHROM POS ID REF ALT QUAL FORMAT HG00108
    1 245916873 rs1934566 A G 4006.41 GT:AD:DP:GQ:PL 0/0:39,0:42:99:0,117,15410

    the result with UnifiedGenotyper of GATK2.5
    CHROM POS ID REF ALT QUAL FORMAT HG00108
    1 245916873 rs1934566 A G 6611.42 GT:AD:DP:GQ:PL 0/1:15,13:29:99:240,0,265

    Now, you can see the difference in the same position between GATK2.5.0 and GATK3.2.2.

    Thank you for your help.

    Best,
    Qiang

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @weiqiang
    Hi Qiang,

    Haplotype Caller does a local reassembly step that can change the read positions. Please read more about the reassembly step here: https://www.broadinstitute.org/gatk/guide/article?id=4146 To see how the reads look after reassembly, you can use the bamout argument. https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php#--bamOutput

    Please post the IGV screenshot of the bamout file from the sample above. Please also post a comparison bamout file from a sample in which the variant does get called.

    Thanks,
    Sheila

Sign In or Register to comment.