Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

EMIT_ALL_SITES in HaplotypeCaller

eflynn90eflynn90 Washington DCPosts: 48Member

If I run HaplotypeCaller with a VCF file as the intervals file, -stand_emit_conf 0, and -out_mode EMIT_ALL_SITES, should I get back an output VCF with all the sites from the input VCF, whether or not there was a variant call there? If not, is there a way to force output even if the calls are 0/0 or ./. for everyone in the cohort?

I have been trying to run HC with the above options, but I can't understand why some variants are included in my output file and others aren't. Some positions are output with no alternate allele and GTs of 0 for everyone. However, other positions that I know have coverage are not output at all.

Thanks,

Elise

Answers

  • pdexheimerpdexheimer Posts: 323Member, GSA Collaborator ✭✭✭

    I'm not speaking from a great deal of experience here (this isn't my typical workflow), but it sounds like you want to use GENOTYPE_GIVEN_ALLELES (the -gt_mode and -alleles arguments).

    I can't offer a good explanation for the behavior of EMIT_ALL_SITES, except that it's very difficult to emit all potential indels - so it may just skip them entirely (or only calculate a fixed subset, or something like that)

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,821Administrator, GATK Developer admin

    I agree that it sounds like you want to use GENOTYPE_GIVEN_ALLELES rather than EMIT_ALL_SITES, if the point is to get calls at the specific sites in a VCF.

    Note that starting with version 2.7, EMIT_ALL_SITES will no longer be available for HaplotypeCaller. Instead, you'll want to use the -ERC argument, which outputs sites called as reference along with a confidence score that they are actually reference.

    Geraldine Van der Auwera, PhD

  • eflynn90eflynn90 Washington DCPosts: 48Member

    So, I tried changing around the variables, but I still cannot get HC to output all sites. My original interval file was a vcf with 83 positions. With the following arguments I get different outputs, but none contains all 83 positions.

    -L sites.vcf -stand_emit_conf 0 -out_mode EMIT_ALL_SITES: 46 positions

    -L sites_notpadded.bed -stand_emit_conf 0 -out_mode EMIT_ALL_SITES: 46 positions

    -L sites_paddedby55bases.bed -stand_emit_conf 0 -out_mode EMIT_ALL_SITES: 79 positions (not all overlap with the original vcf though)

    -L sites.vcf -alleles sites.vcf -gt_mode GENOTYPE_GIVEN_ALLELES -stand_emit_conf 0: 26 positions

    -L sites.vcf -alleles sites.vcf -gt_mode GENOTYPE_GIVEN_ALLELES -stand_emit_conf 0 -out_mode EMIT_ALL_SITES: 59 positions

    -L sites_notpadded.bed -alleles sites.vcf -gt_mode GENOTYPE_GIVEN_ALLELES -stand_emit_conf 0 -out_mode EMIT_ALL_SITES: 66 positions

    -L sites_paddedby55bases.bed -alleles file.vcf -gt_mode GENOTYPE_GIVEN_ALLELES -stand_emit_conf 0 -out_mode EMIT_ALL_SITES: 80 positions

    -L sites_paddedby110bases.bed -alleles file.vcf -gt_mode GENOTYPE_GIVEN_ALLELES -stand_emit_conf 0 -out_mode EMIT_ALL_SITES: 80 positions

    I briefly looked at the bam files and there are many reads which overlap the uncalled positions.

    I am using GATK v2.6-5-gba531bd, Compiled 2013/07/18 18:05:31

  • eflynn90eflynn90 Washington DCPosts: 48Member

    Ah, okay, I see one problem. For the last two runs (with padded bed files, gt given alleles, and emit all sites), the three SNPs which were not called were labeld "LowQual" in the -alleles vcf. So that makes sense if HC is set up not to genotype places that are labeled as "LowQual" in the FILTER column.

  • eflynn90eflynn90 Washington DCPosts: 48Member

    However, I still don't understand why all sites were not output when the intervals file was a vcf/the unpadded bed file. It was not a question of different length ref alleles, because most of the sites that weren't output were SNPs.

  • eflynn90eflynn90 Washington DCPosts: 48Member

    Either way I'll be able to run HC to genotype at those locations now, so thanks for your help!

  • eflynn90eflynn90 Washington DCPosts: 48Member

    I have one more question about using GENOTYPE_GIVEN_ALLELES. In the documentation, you warn about increasing the max alleles argument because it will drastically slow down HC. Is this true even then you are using the given alleles argument?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,821Administrator, GATK Developer admin

    It depends how you are using GGA mode. If you are passing in the vcf of alleles as intervals list as well (using -L) then the max alleles will be irrelevant because the caller will only genotype those sites with the given alleles. If you don't pass in the sites as intervals, the caller will also attempt to call all other sites in the genome normally (in addition to doing the given sites in GGA mode), and for that the max alleles are still relevant.

    Geraldine Van der Auwera, PhD

  • eflynn90eflynn90 Washington DCPosts: 48Member

    So, if GGA mode is meant to function with the same vcf as the intervals and alleles file, why am I not getting all sites output when I use the vcf as the intervals file?

    Out of 80 valid positions in the vcf file, I got 59 positions with the vcf as the intervals file, and I got 80 positions with the padded bed file as the intervals file. (All other settings were the same and the vcf was the alleles file in both runs.) The 80 positions with the padded bed file were all variants that were in the original vcf file.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,821Administrator, GATK Developer admin

    Hmm. Can you post an example of what variants were left out when using the vcf file as intevals?

    Geraldine Van der Auwera, PhD

  • eflynn90eflynn90 Washington DCPosts: 48Member
    edited August 2013

    For example, this was in the alleles/intervals file:

    1       14773   .       C       T       720.92  .       .  
    1 14930 rs75454623 A G 34793.34 . .
    1 14933 rs199856693 G A 685.76 PASS .
    1 14948 rs201855936 G A 590.15 . .
    1 14976 rs71252251 G A 663.76 . .
    1 15029 rs201045431 G A 269.33 . .
    Post edited by Geraldine_VdAuwera on
  • eflynn90eflynn90 Washington DCPosts: 48Member
    edited August 2013

    And this was in the output vcf:

    1       14773   .       C       T       631.54  .       AC=3;AF=0.250;AN=12;BaseQRankSum=-0.828;ClippingRankSum=-0.028;DP=1449;FS=12.504;MLEAC=3;MLEAF=0.250;MQ=39.59;MQ0=0;MQRankSum=4.766;QD=0.85;ReadPosRankSum=-0.150       GT:AD:DP:GQ:PL  0/1:204,43:247:99:174,0,4860    0/0:270,0:270:99:0,808,7629     0/0:133,0:133:99:0,399,3600......

    1 15029 rs201045431 G A 0 LowQual AC=0;AF=0.00;AN=12;DB;DP=946;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=31.30;MQ0=0 GT:AD:DP:GQ:PL 0/0:131,0:131:99:0,385,3746 0/0:213,0:213:99:0,633,6199 0/0:82,0:82:99:0,241,2293 0/0:167,0:167:99:0,494,4487 0/0:135,0:135:99:0,395,3647 0/0:218,0:218:99:0,654,6411.........
    Post edited by Geraldine_VdAuwera on
  • eflynn90eflynn90 Washington DCPosts: 48Member

    Sorry for the spacing. But basically of those six alleles, only two were output.

  • eflynn90eflynn90 Washington DCPosts: 48Member

    This is the output in using the padded bed file as the intervals file:

    1 14773 . C T 644.54 . AC=3;AF=0.250;AN=12;BaseQRankSum=-1.005;ClippingRankSum=0.337;DP=1419;FS=14.707;MLEAC=3;MLEAF=0.250;MQ=39.50;MQ0=0;MQRankSum=3.588;QD=0.90;ReadPosRankSum=-0.993 GT:AD:DP:GQ:PL 0/1:189,45:234:99:313,0,4469 0/0:270,0:270:99:0,808,7613 0/0:136,0:136:99:0,408,3695 0/1:

    1 14930 rs75454623 A G 30412.34 . AC=6;AF=0.500;AN=12;BaseQRankSum=-1.491;ClippingRankSum=-2.096;DB;DP=1989;FS=37.780;MLEAC=6;MLEAF=0.500;MQ=42.43;MQ0=0;MQRankSum=-4.878;QD=15.29;ReadPosRankSum=2.487 GT:AD:DP:GQ:PL 0/1:241,59:300:99:1287,0,5974 0/1:167,215:382:99:8066,0,3769 0/1:154,55:209:99:13

    1 14933 rs199856693 G A 1146.76 . AC=2;AF=0.167;AN=12;BaseQRankSum=-4.169;ClippingRankSum=-3.201;DB;DP=1992;FS=14.145;MLEAC=2;MLEAF=0.167;MQ=42.68;MQ0=0;MQRankSum=-6.500;QD=1.42;ReadPosRankSum=0.657 GT:AD:DP:GQ:PL 0/0:299,0:299:99:0,888,8528 0/0:382,0:382:99:0,1145,11499 0/0:208,0:208:99:0,620,6005

    1 14948 rs201855936 G A 0 LowQual AC=0;AF=0.00;AN=12;BaseQRankSum=0.615;ClippingRankSum=1.003;DB;DP=1747;FS=3.429;MLEAC=0;MLEAF=0.00;MQ=40.15;MQ0=0;MQRankSum=0.321;ReadPosRankSum=-1.497 GT:AD:DP:GQ:PL 0/0:262,0:262:99:0,833,10392 0/0:332,0:332:99:0,1244,22062 0/0:182,0:182:99:0,595,8088 0/0:333,0:33

    1 14976 rs71252251 G A 0 LowQual AC=0;AF=0.00;AN=12;BaseQRankSum=-1.070;ClippingRankSum=-4.000;DB;DP=1362;FS=20.859;MLEAC=0;MLEAF=0.00;MQ=37.27;MQ0=0;MQRankSum=-4.166;ReadPosRankSum=3.406 GT:AD:DP:GQ:PL 0/0:212,1:213:99:0,716,7259 0/0:265,0:265:99:0,1234,13930 0/0:136,0:136:99:0,501,5256 0/0:

    1 15029 rs201045431 G A 0 LowQual AC=0;AF=0.00;AN=12;DB;DP=931;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=31.19;MQ0=0 GT:AD:DP:GQ:PL 0/0:131,0:131:99:0,385,3746 0/0:203,0:203:99:0,603,5922 0/0:83,0:83:99:0,244,2319 0/0:161,0:161:99:0,542,4982 0/0:133,0:133:99:0,389,3593 0/0:219,0:219:99:0,736,7158

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,821Administrator, GATK Developer admin

    I'm not sure what's happening here. I'll ask @ebanks to have a look.

    Geraldine Van der Auwera, PhD

  • KatherineSKatherineS Posts: 9Member

    Hi, I'm having a similar problem running HaplotypeCaller v2.7-2-g6bda569 in GGA mode. The input VCF (-L and -alleles argument) file contains 2,186 variants, but the output VCF only contains 1,874. When I genotype with UnifiedGenotyper v2.7-2-g6bda569, the output file contains all 2,186 variants in the input file. I've listed the command lines used below.

    More details:

    * The output VCF does contain some variants where both of the input BAM files are genotyped as reference 0/0, and where both are called as missing ./., so these categories aren't being systematically excluded.

    * At least some of the variants not being output by HaplotypeCaller have coverage

    * There are no variants marked as LowQual in the input VCF.

    * The input file is single-sample and all variants in it have genotype 0/1.

    Any ideas on how to get HaplotypeCaller to genotype all 2,186 variants would be much appreciated.

    Thanks,

    Katherine

    HaplotypeCaller

    java -Xmx4g -jar GenomeAnalysisTK.jar \

    -T HaplotypeCaller \

    -R ${reference} \

    -I sample1.bam \

    -I sample2.bam \

    -L $vcfIn \

    -alleles $vcfIn \

    -o output1.vcf \

    -gt_mode GENOTYPE_GIVEN_ALLELES \

    -stand_call_conf 0.0 \

    -stand_emit_conf 0.0

    UnifiedGenotyper

    java -Xmx4g -jar GenomeAnalysisTK.jar \

    -T UnifiedGenotyper \

    -R ${reference} \

    -I sample1.bam \

    -I sample2.bam \

    -L $vcfIn \

    -alleles $vcfIn \

    -o output2.vcf \

    -gt_mode GENOTYPE_GIVEN_ALLELES \

    -out_mode EMIT_ALL_SITES \

    -stand_call_conf 0.0 \

    -stand_emit_conf 0.0

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,821Administrator, GATK Developer admin

    To find out what's going on here, I suggest you identify a couple of the variants that are disappearing when you run HC in GGA mode, then run again and use the option to generate an output BAM of what the HC reassembly step produced. Have a look at them in IGV and contrast vs. the original assembly and the original vs new calls. If the reassembly is radically changing the picture of what variation is present it might affect how the variants are output.

    Geraldine Van der Auwera, PhD

  • KatherineSKatherineS Posts: 9Member

    Hi Geraldine,

    I looked at the first ten variants in the input VCF, of which only four were genotyped. I extracted these ten variants to a new VCF and then tried to genotype them using --bamOutput out.bam and --bamWriterType CALLED_HAPLOTYPES. Five of the six variants that are not genotyped have no coverage in the HC output BAM - these are all located within ~ 20 bp. The sixth variant that is not genotyped has a few reads of coverage. All have coverage in the original BAM file.

    I tried a simplified experiment where I called variants for a single sample using HaplotypeCaller (restricting analysis to a single chromosome). This yielded 1290 variants (it's a low coverage sample). I then attempted to genotype the same sample at the locations of the variants just detected from that sample. Only 1075 variants were output by the genotyping step. Commands used:

    Variant calling

    java -Xmx4g -jar GenomeAnalysisTK.jar \

    -T HaplotypeCaller \

    -R ${reference} \

    -I NA18507.bam \

    -L chr21 \

    -o NA18507.var.hc.chr21.vcf \

    -stand_call_conf 20 \

    -stand_emit_conf 20

    Genotyping

    java -Xmx4g -jar GenomeAnalysisTK.jar \

    -T HaplotypeCaller \

    -R ${reference} \

    -I NA18507.bam \

    -L NA18507.var.hc.chr21.vcf \

    -alleles NA18507.var.hc.chr21.vcf \

    -o NA18507.geno.hc.chr21.vcf \

    -gt_mode GENOTYPE_GIVEN_ALLELES \

    -stand_call_conf 0.0 \

    -stand_emit_conf 0.0

    I would have expected all variant lines in the VCF file output by variant calling to be present in the VCF file output by genotyping. Am I missing something? Thanks for your thoughts!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,821Administrator, GATK Developer admin

    Hi @KatherineS,

    The issue here might be that HaplotypeCaller does not do well on single-base intervals. This is something I had not run into before, so my apologies for the lack of documentation, but for running HC in GGA mode, you actually need to pad the intervals with e.g. 150 bp on either side. You can do this automatically with the --interval_padding argument. Please try this and let me know if it solves your problem.

    Geraldine Van der Auwera, PhD

  • KatherineSKatherineS Posts: 9Member

    Hi @Geraldine_VdAuwera,

    That definitely helps! Adding -ip 150 to the genotyping step results in 1290/1290 variant lines being output for the simplified experiment. For my application of interest, it results in 2179/2186 variant lines being output. -ip 200 yields 2185/2186. The remaining variant, rs12012803, isn't genotyped even at -ip 20000.

    Thank you so much for your help - I'm hugely grateful.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,821Administrator, GATK Developer admin

    Glad to hear it :)

    I'll document this as a gotcha for running HC in GGA mode.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.