Missing sites in vcf file

Hi,

I used GenotypeGVCFs with 3 input gvcf files (3 individuals) to create a vcf file, and this seems to work, but when I examine the sites in the final vcf file, there are sites that are missing. I am in the process of calculating exactly how many sites are missing, but taking an initial section of the vcf file and initial sections of my 3 gvcf files, the initial set of variant positions in the 3 gvcf files combined (called "test file") is:

16050036
16050612
16050822
16050933
16051556
16051968
16051994
16052080
16052167
16052239
16052250
16052357
etc.

whereas the initial set of sites in my final vcf file is:

16050822
16050933
16051347
16051497
16051556
16051968
16051994
16052080
16052167
16052169
16052239
16052357
etc.

First of all, there are sites in the final vcf file (in bold) that are fixed for the 3 individuals, but that are still included (the individuals are all 0/1 at these positions). I removed these positions when I created my test file, so they don't show up there, as you can see. Second, there are sites in the test file that are not in the final vcf file (in bold), even though these are most definitely variant sites (I checked them - f.ex., 16050036 is a SNP). I'm not sure why these discrepancies are occurring.

I also got this warning 3 times when running GenotypeGVCFs:
WARN 20:04:45,102 ExactAFCalculator - this tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at 22:21483632 has 7 alternate alleles so only the top alleles will be used
How do I relax the requirements of "at most 6 alternate alleles" to allow more in the case of indels? I am using the newest version of GATK (3.3).

FYI, this is the command I used for GenotypeGVCFs:
java -Xmx2g -jar GenomeAnalysisTK.jar -R hs37d5.fa -T GenotypeGVCFs --variant file1.vcf --variant file2.vcf --variant file3.vcf -o final.vcf -L 22
(only running this for chr22)

Thank you in advance,
Alva

Best Answers

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @astrand‌

    Hi Alva,

    Can you please try this with the latest nightly build? This seems like a bug that has been recently fixed. You can get the nightly here: https://www.broadinstitute.org/gatk/nightly

    You can change the maximum number of alternate alleles with --max_alternate_alleles argument https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php#--max_alternate_alleles

    -Sheila

  • astrandastrand New YorkMember

    OK, thanks. I will try the nightly build and report whether the problem was fixed or not.

    Alva

  • astrandastrand New YorkMember

    Hi Sheila,

    I get the following error message when running the latest nightly build (12/9):

    Invalid command line: No tribble type was provided on the command line and the type of the file could not be determined dynamically. Please add an explicit type tag :NAME listing the correct type from among the supported types:

    My command was the following:
    java -Xmx4g -jar GenomeAnalysisTK.jar -R hs37d5.fa -T GenotypeGVCFs --variant file1.vcf --variant file2.vcf --variant file3.vcf --max_alternate_alleles 9 -o final_2_Hutt.vcf -L 22

    The variants are exactly the same as last time.

    Alva

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @astrand‌

    Hi Alva,

    I think GATK is not able to recognize the format of one or more of your vcf files. You can try validating them with VCFTools Validator http://vcftools.sourceforge.net/perl_module.html#vcf-validator and GATK Validate Variants https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_ValidateVariants.php

    -Sheila

  • astrandastrand New YorkMember

    Hi Sheila,

    I tried using one of the latest nightly builds and re-run GenotypeGVCFs using that nightly build. I was able to create a final vcf file, but the result was exactly the same: missing sites. What should I do?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @astrand‌

    Hi Alva,

    Can you post the command you used to generate the GVCFs? Also, please post some records that show the missing positions in the vcf that are present in the GVCF.

    Thanks,
    Sheila

  • astrandastrand New YorkMember

    The command I used to generate the GVCFs is the following:
    java -Xmx2g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R hs37d5.fa -I recal-1.bam -o raw-1.vcf -L 22 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000

    Here are the heads of the raw vcf (gvcf) files and you can see that position 16050036 is different:

    raw-1.vcf:

    22 1 . N <NON_REF> . . END=16050022 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0 22 16050023 . C <NON_REF> . . END=16050023 GT:DP:GQ:MIN_DP:PL 0/0:1:3:1:0,3,37 22 16050024 . A <NON_REF> . . END=16050026 GT:DP:GQ:MIN_DP:PL 0/0:2:6:2:0,6,73 22 16050027 . A <NON_REF> . . END=16050035 GT:DP:GQ:MIN_DP:PL 0/0:3:9:3:0,9,110 22 16050036 . A C,<NON_REF> 26.80 . BaseQRankSum=-0.736;ClippingRankSum=-0.736;DP=3;MLEAC=1,0;MLEAF=0.500,0.00;MQ=27.00;MQ0=0;MQRankSum=-0.736;ReadPosRankSum=0.736 GT:AD:DP:GQ:PL:SB 0/1:1,2,0:3:23:55,0,23,58,29,86:1,0,2,0 22 16050037 . G <NON_REF> . . END=16050037 GT:DP:GQ:MIN_DP:PL 0/0:3:9:3:0,9,109

    raw-2.vcf:

    22 1 . N <NON_REF> . . END=16050025 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0 22 16050026 . A <NON_REF> . . END=16050035 GT:DP:GQ:MIN_DP:PL 0/0:1:3:1:0,3,27 22 16050036 . A <NON_REF> . . END=16050036 GT:DP:GQ:MIN_DP:PL 0/0:1:0:1:0,0,0 22 16050037 . G <NON_REF> . . END=16050043 GT:DP:GQ:MIN_DP:PL 0/0:1:3:1:0,3,33

    raw-3.vcf:

    22 1 . N <NON_REF> . . END=16050037 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0 22 16050038 . A <NON_REF> . . END=16050039 GT:DP:GQ:MIN_DP:PL 0/0:1:3:1:0,3,36

    This is the head of the final vcf file and it does not include this position:

    22 16050822 . G A 536.16 . AC=2;AF=0.333;AN=6;BaseQRankSum=3.36;ClippingRankSum=-3.030e-01;DP=71;FS=0.000;GQ_MEAN=136.33;GQ_STDDEV=81.93;MLEAC=2;MLEAF=0.333;MQ=27.88;MQ0=0;MQRankSum=0.910;NCC=0;QD=15.77;ReadPosRankSum=0.530;SOR=0.723 GT:AD:DP:GQ:PL 0/0:37,0:37:99:0,101,1530 0/1:4,12:16:78:327,0,78 0/1:9,9:18:99:241,0,230 22 16050933 . G A 608.13 . AC=1;AF=0.167;AN=6;BaseQRankSum=0.733;ClippingRankSum=-9.280e-01;DP=116;FS=0.000;GQ_MEAN=274.00;GQ_STDDEV=316.19;MLEAC=1;MLEAF=0.167;MQ=41.05;MQ0=0;MQRankSum=-2.081e+00;NCC=0;QD=12.16;ReadPosRankSum=0.987;SOR=0.764 GT:AD:DP:GQ:PL 0/1:28,22:50:99:639,0,841 0/0:36,0:36:99:0,99,1485 0/0:30,0:30:84:0,84,1260
    etc.

    I diagnosed the problem further. I re-checked my bam files to see if they corresponded to the raw vcf files, and they do not at the bold positions. For position 16050036 for instance, the 3 bam files have the base "C" at that position:

    bam file 1:

    HWI-ST1329:286:C2DT8ACXX:1:1104:10037:71314 163 22 16050029 0 100M = 16050288 359 AGCTGTGCGACCTTGGCCAAGTCACTTCCTCCTTCAGGAACATTGCAGTGGGCCTAAGTGCCTCCTCTCGGGACTGGTATGGGGACGGTCATGCAATCTG [email protected][email protected]BDFGDBAAFEEDBE?EBDFDFEECABCD MC:Z:100M BD:Z:JJLLLIJLJJKKLJKKKKLJKKKKIJJKKLKKLJKKLLKKKIKJKLLLKJKJKKLKJKKJLKLKKLKJKJJJKKKLKKKIKKJJKKJJKKKKLMMKKKJL MD:Z:7A92 PG:Z:MarkDuplicates.2A RG:Z:Hutt-1 BI:Z:MMNNNLMNMMNMNMNNNMNLNNNMMNMMNNMNNMMMNMNMMMNMNNNNNMNLNMNNMNNMNMNMNNMMMMMLNNNNNNNMNNLLNNMMNNMNNNNLNMMN NM:i:1 MQ:i:10 AS:i:95 XS:i:100

    bam file 2:

    HWI-ST1329:317:C2UWWACXX:2:2213:13110:2340 99 22 16050026 21 100M = 16050372 446 AAGAGCTGTGCGACCTTGGCCAAGTCACTTCCTCCTTCAGGAACATTGCAGTGGGCCTAAGTGCCTCCTCTCGGGACTGGTATGGGGACGGTCATGCAAT @@@AAD>[email protected];AACCB>?ACCBBACBAB>BAC?<@@CB6=>=7>BBC?CAADD<@=3;>>[email protected][email protected]>=9=>B;[email protected][email protected]> MC:Z:100M BD:Z:JJJJKKKKJJKJJJKKJKKKKKJJKKJJKJJJKJJKJJJKKKJJJKJKKKKKJKJKKKJJJKJKKKJJKJJJJJJKJKKKKJIKKJJKJJJKKJKKKKJJ MD:Z:10A89 PG:Z:MarkDuplicates.1W RG:Z:Hutt-2.I BI:Z:MMMMNNNNMMMMMMMNMMMMMNMMNMMMMMMMNMMNMMMNMMMMMNMMMNNNMMMMMNMMMNMMMNMMNMMMMMMMMMNMMMMMMMMMMMMMMMNMMNMM NM:i:1 MQ:i:23 AS:i:95 XS:i:100

    bam file 3:

    HWI-ST1329:287:C2JDBACXX:3:1115:9283:81316 163 22 16050035 15 100M = 16050306 371 GCGACCTTGGCCAAGTCACTTCCTCCTTCAGGAACATTGCAGTGGGCCTAAGTGCCTCCTCTCGGGACTGGTATGGGGACGGTCATGCAATCTGGACAAC AB:ACCDBE[email protected][email protected] MC:Z:100M BD:Z:JJJJLKKJKKKKKJJKKKIJIJJJKKJIJJKKKKKIKJKKLLKJKJKKKKJJKJKKKKKKKJKJJJKKKLKKKIKKJJKKJJKKKKKKLJJKJLKLKIJK MD:Z:1A98 PG:Z:MarkDuplicates.1K RG:Z:Hutt-3.1 BI:Z:MMMMNMNMNNNNNMNNNMMNMMMNMMNMMMNNNMNMNMNNNNNMNLNNNNMNNMNNNMMNMMMMMLNNNNNNMMNNLLNNMMNNMNNNNMMNMNNNNMMN NM:i:1 MQ:i:15 AS:i:98 XS:i:100

    Speaking of bam files, I am actually confused as to why I only see one haplotype for the range of positions specified. The first individual (for bam file 1) is supposed to be heterozygous at positions 16050036 (A/C), but I only see the base "C" in the file. Am I reading the bam files incorrectly?

    Thanks very much for your help,
    Alva

  • astrandastrand New YorkMember

    Hi Sheila,

    I understand that there are two filters that are being used, --standard_min_confidence_threshold_for_calling and --standard_min_confidence_threshold_for_emitting, and that their default threshold value is both 30. However, in my vcf files, at QUAL (7th field), I only see one value, not 2. Which one is being displayed?

    The filtering process partly explains the issue I had, but not completely. I extracted all the positions that had a quality value below 30 in each of my gvcf files (3 files in total), and ran a test to see if these positions were included in the final vcf file (I expected them not to be included). 1000+ in each file were. One example: position 16060479:

    In gvcf file 1, it appears like this with quality value 0 and 1 alternate + 1 deletion:

    22 16060479 . TTTTTTCTTTTTC T,TTTTTTC,<NON_REF> 0 . DP=60;MLEAC=0,0,0;MLEAF=0.00,0.00,0.00;MQ=53.55;MQ0=0 GT:AD:DP:GQ:PL:SB 0/0:22,0,0,0:22:72:0,125,1719,78,1057,1068,72,1050,998,991:15,7,0,0

    This seems strange to me because there are 0 alternate reads or reads showing deletions (22,0,0,0), yet we have T,TTTTTTC at the ALT field.

    In gvcf file 2, appears like this:

    22 16060479 . TTTTTTCTTTTTC T,<NON_REF> 180.73 . BaseQRankSum=0.047;ClippingRankSum=-0.047;DP=37;MLEAC=1,0;MLEAF=0.500,0.00;MQ=50.02;MQ0=0;MQRankSum=-0.796;ReadPosRankSum=0.140 GT:AD:DP:GQ:PGT:PID:PL:SB 0/1:12,6,0:18:99:0|1:16060479_TTTTTTCTTTTTC_T:218,0,469,254,489,743:4,8,4,2

    In gvcf file 3, it appears like this again with quality value 0 and 1 alternate + 1 deletion:

    22 16060479 . TTTTTTCTTTTTC T,TTTTTTC,<NON_REF> 0 . DP=37;MLEAC=0,0,0;MLEAF=0.00,0.00,0.00;MQ=49.38;MQ0=0 GT:AD:DP:GQ:PL:SB 0/0:7,0,0,0:7:22:0,40,534,47,344,620,22,316,325,297:4,3,0,0

    But it appears like this in the final vcf file:

    22 16060479 . TTTTTTCTTTTTC T 187.13 . AC=1;AF=0.167;AN=6;BaseQRankSum=0.047;ClippingRankSum=-4.700e-02;DP=134;FS=0.000;GQ_MEAN=127.67;GQ_STDDEV=89.03;MLEAC=1;MLEAF=0.167;MQ=50.02;MQ0=0;MQRankSum=-7.960e-01;NCC=0;QD=2.60;ReadPosRankSum=0.140;SOR=1.008 GT:AD:DP:GQ:PGT:PID:PL 0/0:22,0:22:99:.:.:0,125,1719 0/1:12,6:18:99:0|1:16060479_TTTTTTCTTTTTC_T:218,0,469 0/0:7,0:7:40:.:.:0,40,534

    I would have thought that filtering would be more strict, and that if at a particular position, one individual does not pass the quality filter, the position is removed entirely... How does GenotypeGVCFs deal with situations where there are 2 alternate alleles or 1 alternate + 1 indel like in the case above? Are there other filters used besides the 2 I mentioned at the beginning?

    Thanks,
    Alva

  • astrandastrand New YorkMember

    From 1000 genomes wiki, the 6th field displays the quality score:

    "QUAL phred-scaled quality score for the assertion made in ALT. i.e. give -10log_10 prob(call in ALT is wrong). If ALT is ”.” (no variant) then this is -10log_10 p(variant), and if ALT is not ”.” this is -10log_10 p(no variant). High QUAL scores indicate high confidence calls. Although traditionally people use integer phred scores, this field is permitted to be a floating point to enable higher resolution for low confidence calls if desired. (Numeric)"

    Seems very much linked to -stand_call_conf, which is described as "the minimum phred-scaled confidence threshold at which variants should be called". I think that you can modify the threshold using the option -stand_call_conf, but the QUAL field is still very helpful in terms of telling you that your position was tossed out because it didn't pass the filter.

  • ... which is exactly why you can specify different thresholds for emission and calling. If emit_conf <= QUAL < call_conf, it's got a LowQual filter

  • astrandastrand New YorkMember

    OK, that makes sense. That was not my issue in the first place, but thanks for providing further explanations (both of you).

  • astrandastrand New YorkMember

    My issue regarding how GenotypeGVCFs handles information regarding different individuals at a specific position stills stands, however (see example that I provided above). In particular, I am confused as to why GenotypeGVCFs does not immediately throw out sites where at least one individual failed the emission filter. None of my sites are labeled as LowQual, if that helps.

  • Why would you want it to? If the presence of low coverage or ambiguous data in a single sample disqualified that site from being called in the entire cohort, you would never be able to do low coverage whole genome experiments. Technical failures would be completely disastrous in the analysis, for that matter so would individuals with large deletions. I suspect that even high quality exomes would start to see significant loss of data at high sample numbers - sometimes you get funny artifacts in sequencing.

  • astrandastrand New YorkMember

    That makes sense. I also checked whether there were any sites that passed the calling confidence threshold that actually were removed from the final vcf, and some were, but very few. Could you explain to me why the following site was removed (info provided below), but not the site at position 16060479, which I provided info of above? I set the --max_alternate_alleles option to be 9.

    In gvcf 1, passes calling confidence threshold:

    22 29416237 . CTT C,CT,<NON_REF> 55.73 . BaseQRankSum=-0.865;ClippingRankSum=0.627;DP=25;MLEAC=1,0,0;MLEAF=0.500,0.00,0.00;MQ=53.16;MQ0=0;MQRankSum=-0.225;ReadPosRankSum=-0.212 GT:AD:DP:GQ:PL:SB 0/1:6,5,3,0:14:19:93,0,176,19,56,74,53,86,82,112:5,1,5,0

    In gvcf 2, does not pass:

    22 29416237 . CT C,<NON_REF> 0 . BaseQRankSum=-2.620;ClippingRankSum=0.930;DP=22;MLEAC=0,0;MLEAF=0.00,0.00;MQ=54.27;MQ0=0;MQRankSum=0.423;ReadPosRankSum=-0.423 GT:AD:DP:GQ:PL:SB 0/0:10,3,0:13:0:0,0,136,29,144,173:10,0,0,0

    In gvcf 3, does not pass:

    22 29416237 . CT C,<NON_REF> 6.01 . BaseQRankSum=-1.380;ClippingRankSum=0.720;DP=19;MLEAC=1,0;MLEAF=0.500,0.00;MQ=56.36;MQ0=0;MQRankSum=0.720;ReadPosRankSum=-0.731 GT:AD:DP:GQ:PL:SB 0/1:2,4,0:6:22:42,0,22,48,33,81:1,1,4,0

    Another example of a site that was removed but don't understand why:

    gvcf1:
    22 41180788 . C CAAAAAA,CAAAAAAA,<NON_REF> 46.73 . BaseQRankSum=-0.731;ClippingRankSum=0.731;DP=12;MLEAC=0,1,0;MLEAF=0.00,0.500,0.00;MQ=58.59;MQ0=0;MQRankSum=-0.731;ReadPosRankSum=-0.731 GT:AD:DP:GQ:PL:SB 0/2:4,0,1,0:5:17:84,17,174,0,167,182,17,174,167,174:4,0,1,0

    gvcf2:
    22 41180788 . C CAAAAA,<NON_REF> 0.31 . BaseQRankSum=0.736;ClippingRankSum=0.736;DP=16;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=0.736;ReadPosRankSum=0.736 GT:AD:DP:GQ:PL:SB 0/1:5,1,0:6:26:26,0,188,41,194,235:5,0,1,0

    gvcf3:
    22 41180788 . C CAAAAA,<NON_REF> 0.05 . DP=12;MLEAC=1,0;MLEAF=0.500,0.00;MQ=58.71;MQ0=0 GT:AD:DP:GQ:PL:SB 0/1:2,0,0:2:18:18,0,83,24,86,110:2,0,0,0

    Thanks for all the help!
    Alva

  • I think you have a subtle misunderstanding of the gVCF workflow (I may too, I'm sure Geraldine will correct me if I get this wrong). The individual gVCFs are not necessarily mature, correct genotypes. The function of these files is really to provide a measure of confidence for each site in the genome, regardless of whether it looks like it might be a reference site or not. GenotypeGVCFs doesn't just combine these files, it actually calculates the genotypes based on the confidences, considered both individually and across the cohort.

    I know that in the final VCFs, the QUAL field is not considered definitive - notice that it's not a covariate in VQSR, for instance. I don't know how important it is in gVCFs. It could be that it's essentially meaningless. So taking all of that together, I don't really think studying individual sites in gVCFs and the final VCF is terribly instructive. If you have sites that you think are miscalled in the final gVCF, it's usually more helpful to look directly at the bams, possibly augmented by the -bamout debug output from HC

Sign In or Register to comment.