Unusual calls after using HaplotypeCaller - filtered with VQSR and refinement workflow

astrandastrand New YorkMember

Hi,

I have discovered some unusual calls in my VCF file after using HaplotypeCaller. I am using version 3.3 of GATK. I applied VQSR as well as the genotype refinement workflow (CalculateGenotypePosteriors, etc.) to refine the calls, but the unusual calls did not get removed. I also calculated the number of Mendelian error just in the biallelic SNPs in my final VCF file (using PLINK) and found unusually high percentage for each of the 3 families I am studying: 0.153%, 0.167%, and 0.25%. The percentage of triallelic SNPs is also very high: 0.111%. Why are the error rates so high?

I used the following commands to call the variants and generate the initial VCF file:

HaplotypeCaller (generate gvcf files for each individual for each chromosome

java -Xmx1g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R hs37d5.fa -I recal_${ROOT}.bam -o ${outpath}raw_${ROOT}.vcf --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000

GenotypeGVCFs (generate vcf files for each chr)

java -Xmx1g -jar GenomeAnalysisTK.jar -R hs37d5.fa -T GenotypeGVCFs -V vcfs.chr${numchr}.new.list -o final_chr${numchr}.vcf -L ${numchr}

CatVariants (generate 1 vcf file with all inds and all chrs)

java -Xmx1g -cp GenomeAnalysisTK.jar org.broadinstitute.gatk.tools.CatVariants -R hs37d5.fa -V final.new.list -out final_allHutt.vcf -assumeSorted

VQSR

java -Xmx4g -jar GenomeAnalysisTK.jar -T VariantRecalibrator -R hs37d5.fa -input final_allHutt.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.b37.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.b37.vcf -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 --disable_auto_index_creation_and_locking_when_reading_rods -recalFile recalibrate_SNP_allHutt_2.recal -tranchesFile recalibrate_SNP_allHutt_2.tranches

Used excludeFiltered here

java -Xmx3g -jar GenomeAnalysisTK.jar -T ApplyRecalibration -R hs37d5.fa -input final_allHutt.vcf -mode SNP --ts_filter_level 99.9 --excludeFiltered --disable_auto_index_creation_and_locking_when_reading_rods -recalFile recalibrate_SNP_allHutt_2.recal -tranchesFile recalibrate_SNP_allHutt_2.tranches -o recalibrated_snps_raw_indels_allHutt_filteredout.vcf

java -Xmx3g -jar GenomeAnalysisTK.jar -T VariantRecalibrator -R hs37d5.fa -input recalibrated_snps_raw_indels_allHutt_filteredout.vcf -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.b37.vcf -an QD -an DP -an FS -an SOR -an ReadPosRankSum -an MQRankSum -mode INDEL -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 --maxGaussians 4 --disable_auto_index_creation_and_locking_when_reading_rods -recalFile recalibrate_INDEL_allHutt_filteredout.recal -tranchesFile recalibrate_INDEL_allHutt_filteredout.tranches

Used excludeFiltered here

java -Xmx3g -jar GenomeAnalysisTK.jar -T ApplyRecalibration -R hs37d5.fa -input recalibrated_snps_raw_indels_allHutt_filteredout.vcf -mode INDEL --ts_filter_level 99.0 --excludeFiltered --disable_auto_index_creation_and_locking_when_reading_rods -recalFile recalibrate_INDEL_allHutt_filteredout.recal -tranchesFile recalibrate_INDEL_allHutt_filteredout.tranches -o recalibrated_variants_allHutt_filteredout.vcf

Genotype Refinement Workflow

java -Xmx3g -jar GenomeAnalysisTK.jar -T CalculateGenotypePosteriors -R hs37d5.fa --supporting ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.vcf -ped Hutt.ped -V recalibrated_variants_allHutt_filteredout.vcf -o recalibrated_variants_allHutt.postCGP.f.vcf

java -Xmx3g -jar GenomeAnalysisTK.jar -T VariantFiltration -R hs37d5.fa -V recalibrated_variants_allHutt.postCGP.f.vcf -G_filter "GQ < 20.0" -G_filterName lowGQ -o recalibrated_variants_allHutt.postCGP.Gfiltered.f.vcf

Again, the first genotype in this example (indel) passed VariantFiltration even though its coverage was zero (2/2:0,0,0:0:PASS)

The entire example is below:

1 20199272 . T TCTTC,C 3520.08 PASS AC=8,22;AF=0.160,0.440;AN=50;BaseQRankSum=-1.356e+00;ClippingRankSum=-1.267e+00;DP=487;FS=4.843;GQ_MEAN=27.84;GQ_STDDEV=40.31;InbreedingCoeff=0.1002;MLEAC=1,12;MLEAF=0.020,0.240;MQ=51.74;MQ0=0;MQRankSum=0.421;NCC=2;PG=0,0,0,0,0,0;QD=32.53;ReadPosRankSum=1.27;SOR=0.699;VQSLOD=0.687;culprit=FS GT:AD:DP:FT:GQ:PGT:PID:PL:PP 2/2:0,0,0:0:PASS:22:.:.:410,207,355,32,22,0:410,207,355,32,22,0 2/2:0,0,1:1:lowGQ:5:.:.:240,51,36,18,5,0:240,51,36,18,5,0 0/2:4,0,4:8:PASS:90:.:.:140,153,256,0,103,90:140,153,256,0,103,90 0/0:22,0,0:22:lowGQ:0:.:.:0,0,390,0,390,390:0,0,390,0,390,390 0/0:2,0,0:2:lowGQ:3:.:.:0,3,45,3,45,45:0,3,45,3,45,45 2/2:0,0,3:3:lowGQ:11:.:.:287,135,124,21,11,0:287,135,124,21,11,0 ./.:7,0,0:7:PASS 2/2:0,0,3:4:lowGQ:11:.:.:282,126,115,22,11,0:282,126,115,22,11,0 0/2:10,0,0:10:lowGQ:5:.:.:27,5,494,0,411,405:27,5,494,0,411,405 0/2:7,0,0:7:lowGQ:13:.:.:13,15,502,0,303,288:13,15,502,0,303,288 0/1:8,6,0:14:PASS:99:.:.:194,0,255,218,273,491:194,0,255,218,273,491 0/0:18,0,0:18:PASS:52:.:.:0,52,755,52,755,755:0,52,755,52,755,755 2/2:0,0,0:0:PASS:23:.:.:305,168,416,23,30,0:305,168,416,23,30,0 0/2:0,0,4:4:lowGQ:14:.:.:40,14,634,0,185,699:40,14,634,0,185,699 0/0:19,0,0:19:PASS:58:.:.:0,58,824,58,824,824:0,58,824,58,824,824 0/0:1,0,0:1:lowGQ:6:0|1:20199257_CT_C:0,6,91,6,91,91:0,6,91,6,91,91 1/1:0,0,0:0:lowGQ:2:.:.:177,11,0,12,2,44:177,11,0,12,2,44 0/1:0,0,3:3:PASS:34:.:.:94,0,388,34,38,304:94,0,388,34,38,304 0/2:15,0,2:17:lowGQ:18:0|1:20199249_CT_C:18,64,695,0,632,624:18,64,695,0,632,624 1/1:0,0,0:0:lowGQ:8:.:.:133,8,0,101,17,265:133,8,0,101,17,265 0/2:3,0,0:3:PASS:25:.:.:129,25,484,0,121,94:129,25,484,0,121,94 0/2:2,0,0:2:PASS:38:.:.:185,38,644,0,88,42:185,38,644,0,88,42 0/2:2,0,0:2:lowGQ:14:.:.:256,14,293,0,57,41:256,14,293,0,57,41./.:11,0,0:11:PASS 1/2:0,0,1:1:lowGQ:14:.:.:115,24,14,36,0,359:115,24,14,36,0,359 1/2:0,0,1:1:PASS:28:.:.:188,39,28,35,0,206:188,39,28,35,0,206 2/2:0,0,3:3:lowGQ:8:1|1:20199249_CT_C:88,88,89,8,9,0:88,88,89,8,9,0

Why are some genotypes being passed when there is no support for their genotype? Why are the Mendelian error rates so high?

Thank you very much in advance,
Alva

Best Answers

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @astrand,

    The first part of your question (why the high percentages of probable mendelian error and triallelic SNPs) is beyond the scope of support we can provide. We just don't have the resources to address something like that in detail.

    Regarding your second question, I don't see that you did any filtering on coverage directly -- what you applied was the GQ<20 filter, which it rightfully passed since its GQ is 22. The real question is, why does it have a GQ of 22 when it seems to have zero coverage. Have you looked at the gvcf for that sample, and the original data? That's the first thing to do, to figure out if there is in fact coverage but it's somehow not being reported correctly.

  • astrandastrand New YorkMember

    Hi Geraldine,

    Thanks for your reply. I looked at the gvcf file and see the following information for the position for the 1st individual on chr1:

    1 20199272 . T TCTTC,TCTTCCTTCCTTCCTTC,TCTTTCTTTCTTTCTTTCTTTCTTC,TCTTTCTTTCTTTCTTTCTTTCTTCCTTCCTTCCTTC,<NON_REF> 372.73 . DP=20;MLEAC=0,0,0,0,2;MLEAF=0.00,0.00,0.00,0.00,1.00;MQ=52.78;MQ0=0 GT:AD:DP:GQ:PL:SB 5/5:0,0,0,0,0,0:0:22:410,207,355,232,207,437,212,203,57,355,237,55,286,209,437,32,22,31,26,34,0:0,0,0,0

    The information is exactly the same here as in the combined vcf file. Both the AD and DP fields have values of zero, and the genotype quality is 22 here as well.

    I used samtools mpileup to look at the depth of coverage at that site for that individual at that chr. I got the following information:

    1 20199272 T 29 ,$,$,,,,..,,....,,,,C,cccCcccCC @@BCAE!<>='D<;=B:B!-1*6$68;57

    It seems that the reference allele is supported, as well as the alternative C allele. The depth of coverage is 29.

    My question is: how does HaplotypeCaller calculate genotype quality? How come the GQ value is 22? samtools mpileup only tells me information about the base quality, not the mapping quality. How do I go about diagnosing this problem further.

    Thanks very much in advance,
    Alva

  • astrandastrand New YorkMember

    Thanks Steve, that is somewhat reassuring. Right, PL denotes the phred-scaled likelihood of that genotype existing. This is what makes the most sense to me right now. Maybe the GATK folks will have some additional insights into this; otherwise, I'm just going to go ahead and mask those genotypes as missing.

    Alva

  • astrandastrand New YorkMember

    Hi Geraldine,

    I submitted a bug report; the zipped file is called ZeroCoverageCalls.zip. Let me know if there are any issues.

    Thanks,
    Alva

    Issue · Github
    by Geraldine_VdAuwera

    Issue Number
    926
    State
    closed
    Last Updated
    Assignee
    Array
    Closed By
    vdauwera
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @astrand
    Hi Alva,

    You case is indeed strange. I have just submitted a bug report and will let you know when it is fixed.

    -Sheila

  • astrandastrand New YorkMember
  • astrandastrand New YorkMember

    I was just revisiting the Best Practices workflow and saw that the aligning around indels and base recalibration steps need to be performed on the whole genome. I did not realize that and so performed the steps on individual chromosomes for each individual in my sample. Would this potentially cause problems in the long run? How can I check whether this caused problems with my files? Thanks.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @astrand
    Hi Alva,

    Running Indel Realigner and Base Recalibration on each chromosome is fine. I don't think that is what is causing this bug.

    -Sheila

  • astrandastrand New YorkMember

    OK, that's good, thanks for your reply @Sheila.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @astrand
    Hi Alva,

    With only one sample, the final vcf (from both Haplotype Caller and GenotypeGVCFs) does not contain a record for that site. The reason the GVCF contains all 0 ADs is because the reads are all uninformative. A read is considered uninformative if the likelihood between the read's most likely allele and second most likely allele is below 0.2. I realize this is confusing, so I am going to prepare a document to explain this more. Basically, the most likely allele should be ~60% more likely than the next most likely allele in a read. In your example, there are 20 reads, but the likelihoods between each read's most likely allele and the second most likely allele are ~0.02 which is less than the threshold of 0.2

    If you can submit a few more sample bams that will make the site appear in the final vcf, I can submit an extended bug report.

    I hope this helps.

    -Sheila

Sign In or Register to comment.