The current GATK version is 3.2-2

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

# EMIT_ALL_SITES in HaplotypeCaller

Washington DCPosts: 56Member

If I run HaplotypeCaller with a VCF file as the intervals file, -stand_emit_conf 0, and -out_mode EMIT_ALL_SITES, should I get back an output VCF with all the sites from the input VCF, whether or not there was a variant call there? If not, is there a way to force output even if the calls are 0/0 or ./. for everyone in the cohort?

I have been trying to run HC with the above options, but I can't understand why some variants are included in my output file and others aren't. Some positions are output with no alternate allele and GTs of 0 for everyone. However, other positions that I know have coverage are not output at all.

Thanks,

Elise

Tagged:

• Posts: 330Member, GSA Collaborator ✭✭✭

I'm not speaking from a great deal of experience here (this isn't my typical workflow), but it sounds like you want to use GENOTYPE_GIVEN_ALLELES (the -gt_mode and -alleles arguments).

I can't offer a good explanation for the behavior of EMIT_ALL_SITES, except that it's very difficult to emit all potential indels - so it may just skip them entirely (or only calculate a fixed subset, or something like that)

I agree that it sounds like you want to use GENOTYPE_GIVEN_ALLELES rather than EMIT_ALL_SITES, if the point is to get calls at the specific sites in a VCF.

Note that starting with version 2.7, EMIT_ALL_SITES will no longer be available for HaplotypeCaller. Instead, you'll want to use the -ERC argument, which outputs sites called as reference along with a confidence score that they are actually reference.

Geraldine Van der Auwera, PhD

• Washington DCPosts: 56Member

So, I tried changing around the variables, but I still cannot get HC to output all sites. My original interval file was a vcf with 83 positions. With the following arguments I get different outputs, but none contains all 83 positions.

-L sites.vcf -stand_emit_conf 0 -out_mode EMIT_ALL_SITES: 46 positions

-L sites_notpadded.bed -stand_emit_conf 0 -out_mode EMIT_ALL_SITES: 46 positions

-L sites_paddedby55bases.bed -stand_emit_conf 0 -out_mode EMIT_ALL_SITES: 79 positions (not all overlap with the original vcf though)

-L sites.vcf -alleles sites.vcf -gt_mode GENOTYPE_GIVEN_ALLELES -stand_emit_conf 0: 26 positions

-L sites.vcf -alleles sites.vcf -gt_mode GENOTYPE_GIVEN_ALLELES -stand_emit_conf 0 -out_mode EMIT_ALL_SITES: 59 positions

-L sites_notpadded.bed -alleles sites.vcf -gt_mode GENOTYPE_GIVEN_ALLELES -stand_emit_conf 0 -out_mode EMIT_ALL_SITES: 66 positions

-L sites_paddedby55bases.bed -alleles file.vcf -gt_mode GENOTYPE_GIVEN_ALLELES -stand_emit_conf 0 -out_mode EMIT_ALL_SITES: 80 positions

-L sites_paddedby110bases.bed -alleles file.vcf -gt_mode GENOTYPE_GIVEN_ALLELES -stand_emit_conf 0 -out_mode EMIT_ALL_SITES: 80 positions

I briefly looked at the bam files and there are many reads which overlap the uncalled positions.

I am using GATK v2.6-5-gba531bd, Compiled 2013/07/18 18:05:31

• Washington DCPosts: 56Member

Ah, okay, I see one problem. For the last two runs (with padded bed files, gt given alleles, and emit all sites), the three SNPs which were not called were labeld "LowQual" in the -alleles vcf. So that makes sense if HC is set up not to genotype places that are labeled as "LowQual" in the FILTER column.

• Washington DCPosts: 56Member

However, I still don't understand why all sites were not output when the intervals file was a vcf/the unpadded bed file. It was not a question of different length ref alleles, because most of the sites that weren't output were SNPs.

• Washington DCPosts: 56Member

Either way I'll be able to run HC to genotype at those locations now, so thanks for your help!

• Washington DCPosts: 56Member

I have one more question about using GENOTYPE_GIVEN_ALLELES. In the documentation, you warn about increasing the max alleles argument because it will drastically slow down HC. Is this true even then you are using the given alleles argument?

It depends how you are using GGA mode. If you are passing in the vcf of alleles as intervals list as well (using -L) then the max alleles will be irrelevant because the caller will only genotype those sites with the given alleles. If you don't pass in the sites as intervals, the caller will also attempt to call all other sites in the genome normally (in addition to doing the given sites in GGA mode), and for that the max alleles are still relevant.

Geraldine Van der Auwera, PhD

• Washington DCPosts: 56Member

So, if GGA mode is meant to function with the same vcf as the intervals and alleles file, why am I not getting all sites output when I use the vcf as the intervals file?

Out of 80 valid positions in the vcf file, I got 59 positions with the vcf as the intervals file, and I got 80 positions with the padded bed file as the intervals file. (All other settings were the same and the vcf was the alleles file in both runs.) The 80 positions with the padded bed file were all variants that were in the original vcf file.

Hmm. Can you post an example of what variants were left out when using the vcf file as intevals?

Geraldine Van der Auwera, PhD

• Washington DCPosts: 56Member
edited August 2013

For example, this was in the alleles/intervals file:

1       14773   .       C       T       720.92  .       .
1       14930   rs75454623      A       G       34793.34        .       .
1       14933   rs199856693     G       A       685.76  PASS    .
1       14948   rs201855936     G       A       590.15  .       .
1       14976   rs71252251      G       A       663.76  .       .
1       15029   rs201045431     G       A       269.33  .       .

Post edited by Geraldine_VdAuwera on
• Washington DCPosts: 56Member
edited August 2013

And this was in the output vcf:

1       14773   .       C       T       631.54  .       AC=3;AF=0.250;AN=12;BaseQRankSum=-0.828;ClippingRankSum=-0.028;DP=1449;FS=12.504;MLEAC=3;MLEAF=0.250;MQ=39.59;MQ0=0;MQRankSum=4.766;QD=0.85;ReadPosRankSum=-0.150       GT:AD:DP:GQ:PL  0/1:204,43:247:99:174,0,4860    0/0:270,0:270:99:0,808,7629     0/0:133,0:133:99:0,399,3600......

1       15029   rs201045431     G       A       0       LowQual AC=0;AF=0.00;AN=12;DB;DP=946;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=31.30;MQ0=0 GT:AD:DP:GQ:PL  0/0:131,0:131:99:0,385,3746     0/0:213,0:213:99:0,633,6199     0/0:82,0:82:99:0,241,2293       0/0:167,0:167:99:0,494,4487     0/0:135,0:135:99:0,395,3647     0/0:218,0:218:99:0,654,6411.........

Post edited by Geraldine_VdAuwera on
• Washington DCPosts: 56Member

Sorry for the spacing. But basically of those six alleles, only two were output.

• Washington DCPosts: 56Member

This is the output in using the padded bed file as the intervals file:

1 14773 . C T 644.54 . AC=3;AF=0.250;AN=12;BaseQRankSum=-1.005;ClippingRankSum=0.337;DP=1419;FS=14.707;MLEAC=3;MLEAF=0.250;MQ=39.50;MQ0=0;MQRankSum=3.588;QD=0.90;ReadPosRankSum=-0.993 GT:AD:DP:GQ:PL 0/1:189,45:234:99:313,0,4469 0/0:270,0:270:99:0,808,7613 0/0:136,0:136:99:0,408,3695 0/1:

1 14930 rs75454623 A G 30412.34 . AC=6;AF=0.500;AN=12;BaseQRankSum=-1.491;ClippingRankSum=-2.096;DB;DP=1989;FS=37.780;MLEAC=6;MLEAF=0.500;MQ=42.43;MQ0=0;MQRankSum=-4.878;QD=15.29;ReadPosRankSum=2.487 GT:AD:DP:GQ:PL 0/1:241,59:300:99:1287,0,5974 0/1:167,215:382:99:8066,0,3769 0/1:154,55:209:99:13

1 14933 rs199856693 G A 1146.76 . AC=2;AF=0.167;AN=12;BaseQRankSum=-4.169;ClippingRankSum=-3.201;DB;DP=1992;FS=14.145;MLEAC=2;MLEAF=0.167;MQ=42.68;MQ0=0;MQRankSum=-6.500;QD=1.42;ReadPosRankSum=0.657 GT:AD:DP:GQ:PL 0/0:299,0:299:99:0,888,8528 0/0:382,0:382:99:0,1145,11499 0/0:208,0:208:99:0,620,6005

1 14948 rs201855936 G A 0 LowQual AC=0;AF=0.00;AN=12;BaseQRankSum=0.615;ClippingRankSum=1.003;DB;DP=1747;FS=3.429;MLEAC=0;MLEAF=0.00;MQ=40.15;MQ0=0;MQRankSum=0.321;ReadPosRankSum=-1.497 GT:AD:DP:GQ:PL 0/0:262,0:262:99:0,833,10392 0/0:332,0:332:99:0,1244,22062 0/0:182,0:182:99:0,595,8088 0/0:333,0:33

1 14976 rs71252251 G A 0 LowQual AC=0;AF=0.00;AN=12;BaseQRankSum=-1.070;ClippingRankSum=-4.000;DB;DP=1362;FS=20.859;MLEAC=0;MLEAF=0.00;MQ=37.27;MQ0=0;MQRankSum=-4.166;ReadPosRankSum=3.406 GT:AD:DP:GQ:PL 0/0:212,1:213:99:0,716,7259 0/0:265,0:265:99:0,1234,13930 0/0:136,0:136:99:0,501,5256 0/0:

1 15029 rs201045431 G A 0 LowQual AC=0;AF=0.00;AN=12;DB;DP=931;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=31.19;MQ0=0 GT:AD:DP:GQ:PL 0/0:131,0:131:99:0,385,3746 0/0:203,0:203:99:0,603,5922 0/0:83,0:83:99:0,244,2319 0/0:161,0:161:99:0,542,4982 0/0:133,0:133:99:0,389,3593 0/0:219,0:219:99:0,736,7158

I'm not sure what's happening here. I'll ask @ebanks to have a look.

Geraldine Van der Auwera, PhD

• Washington DCPosts: 56Member

Thanks.

• Posts: 9Member

Hi, I'm having a similar problem running HaplotypeCaller v2.7-2-g6bda569 in GGA mode. The input VCF (-L and -alleles argument) file contains 2,186 variants, but the output VCF only contains 1,874. When I genotype with UnifiedGenotyper v2.7-2-g6bda569, the output file contains all 2,186 variants in the input file. I've listed the command lines used below.

More details:
* The output VCF does contain some variants where both of the input BAM files are genotyped as reference 0/0, and where both are called as missing ./., so these categories aren't being systematically excluded.
* At least some of the variants not being output by HaplotypeCaller have coverage
* There are no variants marked as LowQual in the input VCF.
* The input file is single-sample and all variants in it have genotype 0/1.

Any ideas on how to get HaplotypeCaller to genotype all 2,186 variants would be much appreciated.

Thanks,

Katherine

#### HaplotypeCaller

java -Xmx4g -jar GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R ${reference} \ -I sample1.bam \ -I sample2.bam \ -L$vcfIn \
-alleles $vcfIn \ -o output1.vcf \ -gt_mode GENOTYPE_GIVEN_ALLELES \ -stand_call_conf 0.0 \ -stand_emit_conf 0.0 #### UnifiedGenotyper java -Xmx4g -jar GenomeAnalysisTK.jar \ -T UnifiedGenotyper \ -R${reference} \
-I sample1.bam \
-I sample2.bam \
-L $vcfIn \ -alleles$vcfIn \
-o output2.vcf \
-gt_mode GENOTYPE_GIVEN_ALLELES \
-out_mode EMIT_ALL_SITES \
-stand_call_conf 0.0 \
-stand_emit_conf 0.0

To find out what's going on here, I suggest you identify a couple of the variants that are disappearing when you run HC in GGA mode, then run again and use the option to generate an output BAM of what the HC reassembly step produced. Have a look at them in IGV and contrast vs. the original assembly and the original vs new calls. If the reassembly is radically changing the picture of what variation is present it might affect how the variants are output.

Geraldine Van der Auwera, PhD

• Posts: 9Member

Hi Geraldine,

I looked at the first ten variants in the input VCF, of which only four were genotyped. I extracted these ten variants to a new VCF and then tried to genotype them using --bamOutput out.bam and --bamWriterType CALLED_HAPLOTYPES. Five of the six variants that are not genotyped have no coverage in the HC output BAM - these are all located within ~ 20 bp. The sixth variant that is not genotyped has a few reads of coverage. All have coverage in the original BAM file.

I tried a simplified experiment where I called variants for a single sample using HaplotypeCaller (restricting analysis to a single chromosome). This yielded 1290 variants (it's a low coverage sample). I then attempted to genotype the same sample at the locations of the variants just detected from that sample. Only 1075 variants were output by the genotyping step. Commands used:

#### Variant calling

java -Xmx4g -jar GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R ${reference} \ -I NA18507.bam \ -L chr21 \ -o NA18507.var.hc.chr21.vcf \ -stand_call_conf 20 \ -stand_emit_conf 20 #### Genotyping java -Xmx4g -jar GenomeAnalysisTK.jar \ -T HaplotypeCaller \ -R${reference} \
-I NA18507.bam \
-L NA18507.var.hc.chr21.vcf \
-alleles NA18507.var.hc.chr21.vcf \
-o NA18507.geno.hc.chr21.vcf \
-gt_mode GENOTYPE_GIVEN_ALLELES \
-stand_call_conf 0.0 \
-stand_emit_conf 0.0

I would have expected all variant lines in the VCF file output by variant calling to be present in the VCF file output by genotyping. Am I missing something? Thanks for your thoughts!

Hi @KatherineS,

The issue here might be that HaplotypeCaller does not do well on single-base intervals. This is something I had not run into before, so my apologies for the lack of documentation, but for running HC in GGA mode, you actually need to pad the intervals with e.g. 150 bp on either side. You can do this automatically with the --interval_padding argument. Please try this and let me know if it solves your problem.

Geraldine Van der Auwera, PhD

• Posts: 9Member

That definitely helps! Adding -ip 150 to the genotyping step results in 1290/1290 variant lines being output for the simplified experiment. For my application of interest, it results in 2179/2186 variant lines being output. -ip 200 yields 2185/2186. The remaining variant, rs12012803, isn't genotyped even at -ip 20000.

Thank you so much for your help - I'm hugely grateful.