Service notice: Several of our team members are on vacation so service will be slow through at least July 13th, possibly longer depending on how much backlog accumulates during that time. This means that for a while it may take us more time than usual to answer your questions. Thank you for your patience.

Variant called by HC --emitRefConfidence GVCF but not GenotypeGVCFs

Hi,

I first use HaplotypeCaller to call variants with --emitRefConfidence GVCF, and then the tool GenotypeGVCFs on only the sample. In other words, I did not apply GenotypeGVCFs on cohort but on the sample itself. There was a record called in the first step but was not output in the second step. The particular record is showed below:

chr1 78435701 rs202224025 TA T,TAA, 23.75 . BaseQRankSum=1.312;DB;DP=62;MLEAC=0,1,0;MLEAF=0.00,0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=1.101;ReadPosRankSum=0.773 GT:AD:GQ:PL:SB 0/2:31,7,8,0:36:61,36,882,0,568,734,154,759,709,863:1,30,0,8

The command lines used are below:

java -Xmx5g -Djava.io.tmpdir=pwd/tmp -jar /Software/GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -T HaplotypeCaller -R /Data/bundle_2.8_hg19/ucsc.hg19.fasta -I /input/chr1.bam -ERC GVCF -variant_index_type LINEAR -variant_index_parameter 128000 --dbsnp /Data/bundle_2.8_hg19/dbsnp_138.hg19.vcf -A StrandOddsRatio -A Coverage -A QualByDepth -A FisherStrand -A MappingQualityRankSumTest -A ReadPosRankSumTest -A RMSMappingQuality -o /output/chr1.gvcf.vcf -L chr1

java -Xmx5g -Djava.io.tmpdir=pwd/tmp -jar /Software/GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -T GenotypeGVCFs -R /Data/bundle_2.8_hg19/ucsc.hg19.fasta --variant /output/chr1.gvcf.vcf -A StrandOddsRatio -A Coverage -A QualByDepth -A FisherStrand -A MappingQualityRankSumTest -A ReadPosRankSumTest -A RMSMappingQuality -o /output/chr1.gatkHC.vcf --dbsnp /Data/bundle_2.8_hg19/dbsnp_138.hg19.vcf -stand_call_conf 30.0 -stand_emit_conf 10.0 -L /BED/chr1.bed

I thought that this variant should be emitted into the final vcf, since the qual is 23.75 which is greater than 10.0 (set by -stand_emit_conf). Do I misunderstand something here?

Thank you!

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @EmmaDan
    Hi,

    It looks like the site could either be 0/1 or 0/2. Note there are 7 T and 8 TAA in the AD, so the caller is not completely confident which genotype it is. Can you post screenshots of the bam file and bamout file at the site?

    Thanks,
    Sheila

  • EmmaDanEmmaDan ChinaMember

    @Shelia,

    Hi,

    Thank you very much for the quick response! Below is the link for the snapshot of the original bam file (named as OriginalBAM.bam) and bam file derived from -bamout (named as bamoutBAM.bam). Hope this can help!
    Thank you!

    https://onedrive.live.com/?cid=3b39a38fe1bf8448&id=3b39a38fe1bf8448!105&ithint=folder,png&authkey=!AEue4wEqwGZBWR0

    Emma

  • SheilaSheila Broad InstituteMember, Broadie, Moderator
    edited June 2015

    @EmmaDan
    Hi Emma,

    Can you please try running Haplotype Caller in normal mode on the bam file? Please simply run Haplotype Caller with no extra annotations added to the command and without -ERC GVCF. Can you confirm the deletion reads and the insertion reads have good quality scores?

    Thanks,
    Sheila

  • EmmaDanEmmaDan ChinaMember

    @Sheila,

    Hi Sheila,

    I produced the bam file by running HC in normal mode and without any extra annotations following your instruction. But the positition (chr1 78435701) was not called this time and therefore no reads were output to cover it. Below is the link for the snapshot. This position was neither called by GenotypeGVCFs when running on the gvcf file produced by HC with -ERC GVCF, which was actually called in this step as I have described in very beginning post.

    https://onedrive.live.com/?cid=3b39a38fe1bf8448&id=3B39A38FE1BF8448!109&v=3&ithint=photo,png&authkey=!APZqJoP-oS5HOhU

    In fact, I have got five samples called at this postition in the gvcf file produced by HC with -ERC GVCF. Here are all the records:

    Sample HSF:
    chr1 78435701 . T TA, 0.01 . BaseQRankSum=-0.028;DP=47;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.23;MQ0=0;MQRankSum=0.484;ReadPosRankSum=-0.427 GT:AD:GQ:PL:SB 0/1:28,4,0:12:12,0,563,94,575,669:2,26,0,4

    Sample HYF:
    chr1 78435701 rs202224025 TA T,TAA, 114.73 . BaseQRankSum=1.652;DB;DP=41;MLEAC=1,0,0;MLEAF=0.500,0.00,0.00;MQ=60.26;MQ0=0;MQRankSum=0.057;ReadPosRankSum=-0.019 GT:AD:GQ:PL:SB 0/1:18,10,4,0:99:152,0,428,138,315,605,206,418,548,616:0,18,1,9

    Sample HYH:
    chr1 78435701 rs202224025 TA T,TAA, 0.16 . BaseQRankSum=1.379;DB;DP=69;MLEAC=1,0,0;MLEAF=0.500,0.00,0.00;MQ=60.00;MQ0=0;MQRankSum=0.448;ReadPosRankSum=-0.483 GT:AD:GQ:PL:SB 0/1:30,6,2,0:23:23,0,780,82,648,808,113,699,769,799:2,28,0,6

    Sample WSP:
    chr1 78435701 rs202224025 TA T,TAA, 23.75 . BaseQRankSum=1.312;DB;DP=62;MLEAC=0,1,0;MLEAF=0.00,0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=1.101;ReadPosRankSum=0.773 GT:AD:GQ:PL:SB 0/2:31,7,8,0:36:61,36,882,0,568,734,154,759,709,863:1,30,0,8

    Sample ZJY:
    chr1 78435701 rs202224025 TA T,TAA, 19.78 . BaseQRankSum=-0.308;DB;DP=55;MLEAC=1,0,0;MLEAF=0.500,0.00,0.00;MQ=60.20;MQ0=0;MQRankSum=-0.730;ReadPosRankSum=-0.721 GT:AD:GQ:PL:SB 0/1:25,7,5,0:34:57,0,635,34,452,640,132,580,602,700:0,25,0,7

    But only Sample HYF still hold this recard after running tool GenotypeGVCFs on each individual gvcf (apply GenotypeGVCFs NOT on cohort but on the sample itself). While running GenotypeGVCFs on cohort, the record was regained for three other samples except Sample HSF.

    #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HSF HYF HYH WSP ZJY
    chr1 78435701 rs202224025 TA T 270.89 . BaseQRankSum=1.31;DB;DP=274;FS=0.000;MLEAC=3;MLEAF=0.300;MQ=60.20;MQ0=0;MQRankSum=0.448;QD=2.02;ReadPosRankSum=-4.270e-01;SOR=0.515 GT:AD:GQ:PL 0/0:28,0:82:0,82,657 0/1:18,10:99:152,0,428 0/1:30,6:23:23,0,780 0/1:31,7:25:25,0,846 0/1:25,7:57:57,0,635

    Hope all these information help!
    Thank you very much!

    Emma

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @EmmaDan
    Hi Emma,

    Yes, this is totally expected behavior. The output of GenotypeGVCFs and Haplotype Caller in normal mode should be the same. If you notice, there is not much evidence for the variant in each sample individually, but when called together, the evidence becomes greater. That is the whole point of calling variants on cohorts, so you can rescue variants that may not have enough evidence in one individual sample :smile:

    -Sheila

  • EmmaDanEmmaDan ChinaMember

    @Sheila

    Hi Sheila,

    Thank you very much for the answer! I now better understand the point of calling variants on cohorts. But unfortunately this variant is a false positive verified by sanger sequencing. Since it was not called in each sample individually, but called in cohort mode, I wonder whether it would be better to skip the joint calling step if individual sample have sufficient depth (say an average of 30x?) for variant calling.

    Emma

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @EmmaDan
    Hi Emma,

    Well, that is unfortunate! Can you submit a bug report, so we can take a look at why this is getting called? Hopefully there is something we can do to filter these types of false positives out. Instructions for submitting a bug report are here: http://gatkforums.broadinstitute.org/discussion/1894/how-do-i-submit-a-detailed-bug-report

    Thanks,
    Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hang on, I don't think there's any bug here. If the values in the vcf record are correct, then the program is doing the right thing. If you're sure it's a false positive, then that means you have some sort of sequencing artifact at this site, but that's not the caller's problem, it's a filtering problem. The caller is designed to be super sensitive, and the raw calls are bound to contain false positives. That's why you have to apply filters afterward, preferably VQSR. Did you do any filtering? Filtering on QUAL is not sufficient.

Sign In or Register to comment.