EMIT_ALL_SITES in HaplotypeCaller

eflynn90eflynn90 Washington DCPosts: 56Member

If I run HaplotypeCaller with a VCF file as the intervals file, -stand_emit_conf 0, and -out_mode EMIT_ALL_SITES, should I get back an output VCF with all the sites from the input VCF, whether or not there was a variant call there? If not, is there a way to force output even if the calls are 0/0 or ./. for everyone in the cohort?

I have been trying to run HC with the above options, but I can't understand why some variants are included in my output file and others aren't. Some positions are output with no alternate allele and GTs of 0 for everyone. However, other positions that I know have coverage are not output at all.

Thanks,

Elise

Answers

  • pdexheimerpdexheimer Posts: 483Member, GATK Dev, DSDE Dev mod

    I'm not speaking from a great deal of experience here (this isn't my typical workflow), but it sounds like you want to use GENOTYPE_GIVEN_ALLELES (the -gt_mode and -alleles arguments).

    I can't offer a good explanation for the behavior of EMIT_ALL_SITES, except that it's very difficult to emit all potential indels - so it may just skip them entirely (or only calculate a fixed subset, or something like that)

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,171Administrator, GATK Dev admin

    I agree that it sounds like you want to use GENOTYPE_GIVEN_ALLELES rather than EMIT_ALL_SITES, if the point is to get calls at the specific sites in a VCF.

    Note that starting with version 2.7, EMIT_ALL_SITES will no longer be available for HaplotypeCaller. Instead, you'll want to use the -ERC argument, which outputs sites called as reference along with a confidence score that they are actually reference.

    Geraldine Van der Auwera, PhD

  • eflynn90eflynn90 Washington DCPosts: 56Member

    So, I tried changing around the variables, but I still cannot get HC to output all sites. My original interval file was a vcf with 83 positions. With the following arguments I get different outputs, but none contains all 83 positions.

    -L sites.vcf -stand_emit_conf 0 -out_mode EMIT_ALL_SITES: 46 positions

    -L sites_notpadded.bed -stand_emit_conf 0 -out_mode EMIT_ALL_SITES: 46 positions

    -L sites_paddedby55bases.bed -stand_emit_conf 0 -out_mode EMIT_ALL_SITES: 79 positions (not all overlap with the original vcf though)

    -L sites.vcf -alleles sites.vcf -gt_mode GENOTYPE_GIVEN_ALLELES -stand_emit_conf 0: 26 positions

    -L sites.vcf -alleles sites.vcf -gt_mode GENOTYPE_GIVEN_ALLELES -stand_emit_conf 0 -out_mode EMIT_ALL_SITES: 59 positions

    -L sites_notpadded.bed -alleles sites.vcf -gt_mode GENOTYPE_GIVEN_ALLELES -stand_emit_conf 0 -out_mode EMIT_ALL_SITES: 66 positions

    -L sites_paddedby55bases.bed -alleles file.vcf -gt_mode GENOTYPE_GIVEN_ALLELES -stand_emit_conf 0 -out_mode EMIT_ALL_SITES: 80 positions

    -L sites_paddedby110bases.bed -alleles file.vcf -gt_mode GENOTYPE_GIVEN_ALLELES -stand_emit_conf 0 -out_mode EMIT_ALL_SITES: 80 positions

    I briefly looked at the bam files and there are many reads which overlap the uncalled positions.

    I am using GATK v2.6-5-gba531bd, Compiled 2013/07/18 18:05:31

  • eflynn90eflynn90 Washington DCPosts: 56Member

    Ah, okay, I see one problem. For the last two runs (with padded bed files, gt given alleles, and emit all sites), the three SNPs which were not called were labeld "LowQual" in the -alleles vcf. So that makes sense if HC is set up not to genotype places that are labeled as "LowQual" in the FILTER column.

  • eflynn90eflynn90 Washington DCPosts: 56Member

    However, I still don't understand why all sites were not output when the intervals file was a vcf/the unpadded bed file. It was not a question of different length ref alleles, because most of the sites that weren't output were SNPs.

  • eflynn90eflynn90 Washington DCPosts: 56Member

    Either way I'll be able to run HC to genotype at those locations now, so thanks for your help!

  • eflynn90eflynn90 Washington DCPosts: 56Member

    I have one more question about using GENOTYPE_GIVEN_ALLELES. In the documentation, you warn about increasing the max alleles argument because it will drastically slow down HC. Is this true even then you are using the given alleles argument?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,171Administrator, GATK Dev admin

    It depends how you are using GGA mode. If you are passing in the vcf of alleles as intervals list as well (using -L) then the max alleles will be irrelevant because the caller will only genotype those sites with the given alleles. If you don't pass in the sites as intervals, the caller will also attempt to call all other sites in the genome normally (in addition to doing the given sites in GGA mode), and for that the max alleles are still relevant.

    Geraldine Van der Auwera, PhD

  • eflynn90eflynn90 Washington DCPosts: 56Member

    So, if GGA mode is meant to function with the same vcf as the intervals and alleles file, why am I not getting all sites output when I use the vcf as the intervals file?

    Out of 80 valid positions in the vcf file, I got 59 positions with the vcf as the intervals file, and I got 80 positions with the padded bed file as the intervals file. (All other settings were the same and the vcf was the alleles file in both runs.) The 80 positions with the padded bed file were all variants that were in the original vcf file.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,171Administrator, GATK Dev admin

    Hmm. Can you post an example of what variants were left out when using the vcf file as intevals?

    Geraldine Van der Auwera, PhD

  • eflynn90eflynn90 Washington DCPosts: 56Member
    edited August 2013

    For example, this was in the alleles/intervals file:

    1       14773   .       C       T       720.92  .       .  
    1       14930   rs75454623      A       G       34793.34        .       .  
    1       14933   rs199856693     G       A       685.76  PASS    .  
    1       14948   rs201855936     G       A       590.15  .       .
    1       14976   rs71252251      G       A       663.76  .       .
    1       15029   rs201045431     G       A       269.33  .       .
    
    Post edited by Geraldine_VdAuwera on
  • eflynn90eflynn90 Washington DCPosts: 56Member
    edited August 2013

    And this was in the output vcf:

    1       14773   .       C       T       631.54  .       AC=3;AF=0.250;AN=12;BaseQRankSum=-0.828;ClippingRankSum=-0.028;DP=1449;FS=12.504;MLEAC=3;MLEAF=0.250;MQ=39.59;MQ0=0;MQRankSum=4.766;QD=0.85;ReadPosRankSum=-0.150       GT:AD:DP:GQ:PL  0/1:204,43:247:99:174,0,4860    0/0:270,0:270:99:0,808,7629     0/0:133,0:133:99:0,399,3600......
    
    1       15029   rs201045431     G       A       0       LowQual AC=0;AF=0.00;AN=12;DB;DP=946;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=31.30;MQ0=0 GT:AD:DP:GQ:PL  0/0:131,0:131:99:0,385,3746     0/0:213,0:213:99:0,633,6199     0/0:82,0:82:99:0,241,2293       0/0:167,0:167:99:0,494,4487     0/0:135,0:135:99:0,395,3647     0/0:218,0:218:99:0,654,6411.........
    
    Post edited by Geraldine_VdAuwera on
  • eflynn90eflynn90 Washington DCPosts: 56Member

    Sorry for the spacing. But basically of those six alleles, only two were output.

  • eflynn90eflynn90 Washington DCPosts: 56Member

    This is the output in using the padded bed file as the intervals file:

    1 14773 . C T 644.54 . AC=3;AF=0.250;AN=12;BaseQRankSum=-1.005;ClippingRankSum=0.337;DP=1419;FS=14.707;MLEAC=3;MLEAF=0.250;MQ=39.50;MQ0=0;MQRankSum=3.588;QD=0.90;ReadPosRankSum=-0.993 GT:AD:DP:GQ:PL 0/1:189,45:234:99:313,0,4469 0/0:270,0:270:99:0,808,7613 0/0:136,0:136:99:0,408,3695 0/1:

    1 14930 rs75454623 A G 30412.34 . AC=6;AF=0.500;AN=12;BaseQRankSum=-1.491;ClippingRankSum=-2.096;DB;DP=1989;FS=37.780;MLEAC=6;MLEAF=0.500;MQ=42.43;MQ0=0;MQRankSum=-4.878;QD=15.29;ReadPosRankSum=2.487 GT:AD:DP:GQ:PL 0/1:241,59:300:99:1287,0,5974 0/1:167,215:382:99:8066,0,3769 0/1:154,55:209:99:13

    1 14933 rs199856693 G A 1146.76 . AC=2;AF=0.167;AN=12;BaseQRankSum=-4.169;ClippingRankSum=-3.201;DB;DP=1992;FS=14.145;MLEAC=2;MLEAF=0.167;MQ=42.68;MQ0=0;MQRankSum=-6.500;QD=1.42;ReadPosRankSum=0.657 GT:AD:DP:GQ:PL 0/0:299,0:299:99:0,888,8528 0/0:382,0:382:99:0,1145,11499 0/0:208,0:208:99:0,620,6005

    1 14948 rs201855936 G A 0 LowQual AC=0;AF=0.00;AN=12;BaseQRankSum=0.615;ClippingRankSum=1.003;DB;DP=1747;FS=3.429;MLEAC=0;MLEAF=0.00;MQ=40.15;MQ0=0;MQRankSum=0.321;ReadPosRankSum=-1.497 GT:AD:DP:GQ:PL 0/0:262,0:262:99:0,833,10392 0/0:332,0:332:99:0,1244,22062 0/0:182,0:182:99:0,595,8088 0/0:333,0:33

    1 14976 rs71252251 G A 0 LowQual AC=0;AF=0.00;AN=12;BaseQRankSum=-1.070;ClippingRankSum=-4.000;DB;DP=1362;FS=20.859;MLEAC=0;MLEAF=0.00;MQ=37.27;MQ0=0;MQRankSum=-4.166;ReadPosRankSum=3.406 GT:AD:DP:GQ:PL 0/0:212,1:213:99:0,716,7259 0/0:265,0:265:99:0,1234,13930 0/0:136,0:136:99:0,501,5256 0/0:

    1 15029 rs201045431 G A 0 LowQual AC=0;AF=0.00;AN=12;DB;DP=931;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=31.19;MQ0=0 GT:AD:DP:GQ:PL 0/0:131,0:131:99:0,385,3746 0/0:203,0:203:99:0,603,5922 0/0:83,0:83:99:0,244,2319 0/0:161,0:161:99:0,542,4982 0/0:133,0:133:99:0,389,3593 0/0:219,0:219:99:0,736,7158

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,171Administrator, GATK Dev admin

    I'm not sure what's happening here. I'll ask @ebanks to have a look.

    Geraldine Van der Auwera, PhD

  • eflynn90eflynn90 Washington DCPosts: 56Member

    Thanks.

  • KatherineSKatherineS Posts: 9Member

    Hi, I'm having a similar problem running HaplotypeCaller v2.7-2-g6bda569 in GGA mode. The input VCF (-L and -alleles argument) file contains 2,186 variants, but the output VCF only contains 1,874. When I genotype with UnifiedGenotyper v2.7-2-g6bda569, the output file contains all 2,186 variants in the input file. I've listed the command lines used below.

    More details:
    * The output VCF does contain some variants where both of the input BAM files are genotyped as reference 0/0, and where both are called as missing ./., so these categories aren't being systematically excluded.
    * At least some of the variants not being output by HaplotypeCaller have coverage
    * There are no variants marked as LowQual in the input VCF.
    * The input file is single-sample and all variants in it have genotype 0/1.

    Any ideas on how to get HaplotypeCaller to genotype all 2,186 variants would be much appreciated.

    Thanks,

    Katherine

    HaplotypeCaller

    java -Xmx4g -jar GenomeAnalysisTK.jar \
    -T HaplotypeCaller \
    -R ${reference} \
    -I sample1.bam \
    -I sample2.bam \
    -L $vcfIn \
    -alleles $vcfIn \
    -o output1.vcf \
    -gt_mode GENOTYPE_GIVEN_ALLELES \
    -stand_call_conf 0.0 \
    -stand_emit_conf 0.0

    UnifiedGenotyper

    java -Xmx4g -jar GenomeAnalysisTK.jar \
    -T UnifiedGenotyper \
    -R ${reference} \
    -I sample1.bam \
    -I sample2.bam \
    -L $vcfIn \
    -alleles $vcfIn \
    -o output2.vcf \
    -gt_mode GENOTYPE_GIVEN_ALLELES \
    -out_mode EMIT_ALL_SITES \
    -stand_call_conf 0.0 \
    -stand_emit_conf 0.0

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,171Administrator, GATK Dev admin

    To find out what's going on here, I suggest you identify a couple of the variants that are disappearing when you run HC in GGA mode, then run again and use the option to generate an output BAM of what the HC reassembly step produced. Have a look at them in IGV and contrast vs. the original assembly and the original vs new calls. If the reassembly is radically changing the picture of what variation is present it might affect how the variants are output.

    Geraldine Van der Auwera, PhD

  • KatherineSKatherineS Posts: 9Member

    Hi Geraldine,

    I looked at the first ten variants in the input VCF, of which only four were genotyped. I extracted these ten variants to a new VCF and then tried to genotype them using --bamOutput out.bam and --bamWriterType CALLED_HAPLOTYPES. Five of the six variants that are not genotyped have no coverage in the HC output BAM - these are all located within ~ 20 bp. The sixth variant that is not genotyped has a few reads of coverage. All have coverage in the original BAM file.

    I tried a simplified experiment where I called variants for a single sample using HaplotypeCaller (restricting analysis to a single chromosome). This yielded 1290 variants (it's a low coverage sample). I then attempted to genotype the same sample at the locations of the variants just detected from that sample. Only 1075 variants were output by the genotyping step. Commands used:

    Variant calling

    java -Xmx4g -jar GenomeAnalysisTK.jar \
    -T HaplotypeCaller \
    -R ${reference} \
    -I NA18507.bam \
    -L chr21 \
    -o NA18507.var.hc.chr21.vcf \
    -stand_call_conf 20 \
    -stand_emit_conf 20

    Genotyping

    java -Xmx4g -jar GenomeAnalysisTK.jar \
    -T HaplotypeCaller \
    -R ${reference} \
    -I NA18507.bam \
    -L NA18507.var.hc.chr21.vcf \
    -alleles NA18507.var.hc.chr21.vcf \
    -o NA18507.geno.hc.chr21.vcf \
    -gt_mode GENOTYPE_GIVEN_ALLELES \
    -stand_call_conf 0.0 \
    -stand_emit_conf 0.0

    I would have expected all variant lines in the VCF file output by variant calling to be present in the VCF file output by genotyping. Am I missing something? Thanks for your thoughts!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,171Administrator, GATK Dev admin

    Hi @KatherineS,

    The issue here might be that HaplotypeCaller does not do well on single-base intervals. This is something I had not run into before, so my apologies for the lack of documentation, but for running HC in GGA mode, you actually need to pad the intervals with e.g. 150 bp on either side. You can do this automatically with the --interval_padding argument. Please try this and let me know if it solves your problem.

    Geraldine Van der Auwera, PhD

  • KatherineSKatherineS Posts: 9Member

    Hi @Geraldine_VdAuwera,

    That definitely helps! Adding -ip 150 to the genotyping step results in 1290/1290 variant lines being output for the simplified experiment. For my application of interest, it results in 2179/2186 variant lines being output. -ip 200 yields 2185/2186. The remaining variant, rs12012803, isn't genotyped even at -ip 20000.

    Thank you so much for your help - I'm hugely grateful.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,171Administrator, GATK Dev admin

    Glad to hear it :)

    I'll document this as a gotcha for running HC in GGA mode.

    Geraldine Van der Auwera, PhD

  • YingYing Posts: 1Member

    Hi @Geraldine_VdAuwera‌ and @eflynn90‌ I am just wondering why HaplotypeCaller picks up the site where every sample has the 0/0 genotype? See the last line of VCF file posted by @eflynn90‌ It seems that the site is not a variant across all the samples. Any explanations, @Geraldine_VdAuwera‌ ?

    @Geraldine_VdAuwera said:
    I'm not sure what's happening here. I'll ask ebanks to have a look.
    @eflynn90 said:
    This is the output in using the padded bed file as the intervals file:

    ...

    1 15029 rs201045431 G A 0 LowQual AC=0;AF=0.00;AN=12;DB;DP=931;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=31.19;MQ0=0 GT:AD:DP:GQ:PL 0/0:131,0:131:99:0,385,3746 0/0:203,0:203:99:0,603,5922 0/0:83,0:83:99:0,244,2319 0/0:161,0:161:99:0,542,4982 0/0:133,0:133:99:0,389,3593 0/0:219,0:219:99:0,736,7158

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,171Administrator, GATK Dev admin

    Hi @Ying,

    The line posted here is incomplete, maybe there are samples that are not shown that have the variant. Are you seeing something strange in your own data?

    Geraldine Van der Auwera, PhD

  • ekanterakisekanterakis UKPosts: 12Member

    Hi Geraldine,
    I'm using GATK (3.2-2) HaplotypeCaller and wish to emit all sites (i.e. variants and reference) to my vcf genome-wide. You suggested using the -ERC argument which takes in either BP_RESOLUTION or GVCF. What are the differences between those two options and which one gives me a confidence score for the reference positions? Has anyone tried this option genome-wide?
    Thanks,
    Stathis

  • SheilaSheila Broad InstitutePosts: 1,663Member, GATK Dev, Broadie, Moderator, DSDE Dev admin

    @ekanterakis

    Hello Stathis,

    BP_RESOLUTION will give you every single site (variant and non-variant) as a record, and it will contain the GQ scores.

    GVCF will give you all the sites, but the non-variant sites will be grouped as blocks. The blocks state a start and end position, and they are grouped based on GQ. However, the GQ is not stated in the GVCF. If you run GenotypeGVCFs on the GVCF, you will get the GQ.

    Because BP_RESOLUTION outputs a record for every single site, it does take a lot of time and space. GVCF mode was developed to take less time and space and is more frequently used in genome-wide analysis.

    Please have a look at this article as well: https://www.broadinstitute.org/gatk/guide/article?id=4017

    -Sheila

  • ekanterakisekanterakis UKPosts: 12Member

    Thanks Sheila. I have a follow-up question. Is using BP_RESOLUTION/GVCF going to affect Variant Recalibration downstream? Should I split my variant and non-variant sites and combine them later?
    -Stathis

  • SheilaSheila Broad InstitutePosts: 1,663Member, GATK Dev, Broadie, Moderator, DSDE Dev admin

    @ekanterakis

    Hi Stathis,

    Because variant recalibration is run on SNPs separately and indels separately, you will use SelectVariants to create new vcfs of only SNPs and only indels. You will then run variant recalibration on each of those vcfs. The non-variant sites will not be included in either of the new vcfs, so you do not need to worry about them.

    -Sheila

  • SheilaSheila Broad InstitutePosts: 1,663Member, GATK Dev, Broadie, Moderator, DSDE Dev admin

    @ekanterakis

    Hi again Stathis,

    Please ignore my above comment if you are going to run VQSR. I was thinking about hard filtering when I wrote that.

    Because the GVCF is just an intermediate, you will run GenotypeGVCFs on all of your GVCFs, and get a file without non-variants. If you would like to retain the non-variants, you can use the
    --includeNonVariantSites argument. If you do use the --includeNonVariantSites argument, it doesn't matter because VQSR will just ignore non-variants, and they will be carried on to the output without modification.

    For VQSR, you do not need to run SelectVariants. You can just run all the steps serially on the same original input file, ending up with everything properly recalibrated in one output file (after the half-recalibrated intermediate). Please read more about VQSR workflow here: https://www.broadinstitute.org/gatk/guide/article?id=2805

    -Sheila

  • ekanterakisekanterakis UKPosts: 12Member

    Hi Sheila, I'm running HaplotypeCaller by chromosome and chunking (50Mb at a time). I'm using -ERC BP_RESOLUTION to get every single position. I would then like to run VariantRecalibrator -mode BOTH (can this be done on each of the chunks separately??) and then ApplyRecalibration (this needs to be done separately for SNPs and Indels, right?). Is this a good approach? Where will my reference positions go after ApplyRecalibration? Will they be in the SNPs file? If I then CombineVariants for all chunks of all chromosomes do I get a complete, recalibrated genome vcf?
    Thank you again for your help.
    -Stathis

  • SheilaSheila Broad InstitutePosts: 1,663Member, GATK Dev, Broadie, Moderator, DSDE Dev admin

    @ekanterakis

    Hi Stathis,

    You cannot use -BOTH for VQSR at any point, nor can you chunk VQSR jobs.

    You are correct about the ApplyRecalibration step being done separately on SNPs and indels. Your reference positions will be in the final vcf file, but they will not be recalibrated n the process.

    -Sheila

  • FerFer AustriaPosts: 22Member

    Hi, this is my first post in the blog, and it comes only after trying my best to look for a solution to my problem.
    I'm also using HaplotypeCaller in GENOTYPE_GIVEN_ALLELES mode, and I want it to EMIT_ALL_SITES that I've passed it cause it is important for downstream scripts (before used to run UG and got what I wanted, but now I want the benefits of HC). Even after playing around with --interval_padding parameter I still don't get all sites.
    For couple sites that were not emitted I've checked the bam files, and [no surprise] there were no reads spanning the region (I'm working with very low coverage ~1x data for hundreds of individuals).
    Last thing I've tried was to set to 0 -minReadsPerAlignStart just in case, but it was also unsuccessful. What can I do to get ALL passed sites even if they come as ./. as many others already do?
    Alberto

  • SheilaSheila Broad InstitutePosts: 1,663Member, GATK Dev, Broadie, Moderator, DSDE Dev admin

    @Fer

    Hi Alberto,

    Can you please provide us with what version of GATK you are using and the exact command line you used? Also, can you post some example lines from the VCF showing missing positions?

    Thanks,
    Sheila

  • FerFer AustriaPosts: 22Member
    edited November 2014

    Thanks Sheila,
    I've tried with both versions 3.2 and 3.3.

    java -Xmx6g -jar GenomeAnalysisTK.jar /
    -R $ref /
    -T HaplotypeCaller /
    -I $acc.sort.mkdup.realigned.bam /
    -o $acc.hc.genotyped.vcf /
    --genotyping_mode GENOTYPE_GIVEN_ALLELES /
    -out_mode EMIT_ALL_SITES /
    -stand_call_conf 0.0 /
    -stand_emit_conf 0.0 /
    --interval_padding 200 /
    -minReadsPerAlignStart 0 /
    --alleles %s.%s.hc.snp.vcf /
    -L %s.%s.hc.snp.vcf

    I don't know if this is the example you're asking me for. It comes from the %s.%s.hc.snp.vcf file that provides the alleles of interest. Positions 1420 and 3102 get into the output file, while 2309 and 2310 don't.

    Chr1 1420 . T A 39.17 . AC=2;AF=0.500;AN=4;DP=11;FS=0.000;GQ_MEAN=31.69;GQ_STDDEV=22.03;InbreedingCoeff=0.8131;MQ=60.00;MQ0=0;NCC=0;QD=13.06 GT:AD:DP:GQ:PL 0/0:8,0:8:21:0,21,232 1/1:0,3:3:9:82,9,0

    Chr1 2309 . T A 4209.40 . AC=4;AF=1.00;AN=4;BaseQRankSum=-1.628e+00;ClippingRankSum=0.158;DP=14;FS=3.100;GQ_MEAN=34.00;GQ_STDDEV=13.68;InbreedingCoeff=0.9924;MQ=60.00;MQ0=0;MQRankSum=1.63;NCC=1;QD=26.88;ReadPosRankSum=1.31 GT:AD:DP:GQ:PL 1/1:0,8:8:24:359,24,0 1/1:0,6:6:18:270,18,0

    Chr1 2310 . C A 4209.40 . AC=4;AF=1.00;AN=4;BaseQRankSum=1.10;ClippingRankSum=0.683;DP=14;FS=3.100;GQ_MEAN=34.00;GQ_STDDEV=13.68;InbreedingCoeff=0.9924;MQ=60.00;MQ0=0;MQRankSum=-1.580e-01;NCC=1;QD=28.92;ReadPosRankSum=1.42 GT:AD:DP:GQ:PL 1/1:0,8:8:24:359,24,0 1/1:0,6:6:18:270,18,0

    Chr1 3102 . A G 2245.11 . AC=4;AF=1.00;AN=4;DP=13;FS=0.000;GQ_MEAN=29.69;GQ_STDDEV=17.70;InbreedingCoeff=0.9221;MQ=60.00;MQ0=0;NCC=0;QD=27.05 GT:AD:DP:GQ:PL 1/1:0,11:11:33:320,33,0 1/1:0,2:2:6:57,6,0

    Post edited by Fer on
  • SheilaSheila Broad InstitutePosts: 1,663Member, GATK Dev, Broadie, Moderator, DSDE Dev admin
    edited November 2014

    @Fer

    Hi Alberto,

    The issue is that Haplotype Caller is not compatible with -out_mode EMIT_ALL_SITES. You must use -ERC BP_RESOLUTION instead. This will give you an intermediate GVCF which you will need to run in GenotypeGVCFs to give you the final vcf. Sorry for the confusion. I will fix this in the documentation asap.

    Please read more about -ERC BP_RESOLUTION and intermediate GVCFs here: https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php#--emitRefConfidence

    https://www.broadinstitute.org/gatk/guide/article?id=4017

    Once you have the output GVCF from running HaplotypeCaller with -ERC BP_RESOLUTION, to get the regular vcf with all sites, you will run GenotypeGVCFs with -allSites. Please read about it here: https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_GenotypeGVCFs.php#--includeNonVariantSites

    I hope this helps.

    -Sheila

    Post edited by Sheila on
  • FerFer AustriaPosts: 22Member

    Thanks, it helped to know. However, I was able to find my way around to get what I originally wanted: the output of GENOTYPE_GIVEN_ALLELES emitting all sites present in the vcf provided in --alleles parameter. I just had to run Combine variants afterwards.
    Alberto

  • tinutinu Posts: 44Member

    I am having issue GATK UnifiedGenotyper in GENOTYPE GIVEN ALLELES mode

     java -Xmx4g -jar GATK-3.3/GenomeAnalysisTK.jar -T UnifiedGenotyper \
     -R  hs37d5_chr.fa -glm BOTH  
    --genotyping_mode GENOTYPE_GIVEN_ALLELES  \
    -alleles:VCF diff-chr21_sort.vcf-L diff-chr21_sort.vcf \
    --output_mode EMIT_ALL_SITES -U LENIENT_VCF_PROCESSING \
    -I sample1.bam -o diff-chr21_sort.gga.vcf
    

    My input file (-alleles) file has 3672 variants.
    But GGA called VCF just had 2583 variants.

    Took the variants not in the outputfile (3672 - 2583) and ran UnifiedGenotyper in GGA mode and saw that none of those variants were outputted and these were the GATK messages

     16:18:53,014 HelpFormatter - Date/Time: 2015/05/08 16:18:52
    INFO  16:18:53,015 HelpFormatter - --------------------------------------------------------------------------------
    INFO  16:18:53,016 HelpFormatter - --------------------------------------------------------------------------------
    INFO  16:18:53,428 GenomeAnalysisEngine - Strictness is SILENT
    INFO  16:18:53,651 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250
    INFO  16:18:53,664 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
    INFO  16:18:53,732 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.07
    INFO  16:18:54,356 IntervalUtils - Processing 1494 bp from intervals
    INFO  16:18:54,454 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files
    INFO  16:18:54,603 GenomeAnalysisEngine - Done preparing for traversal
    INFO  16:18:54,604 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
    INFO  16:18:54,604 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining
    INFO  16:18:54,605 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime
    INFO  16:18:54,653 StrandBiasTest - SAM/BAM data was found. Attempting to use read data to calculate strand bias annotations values.
    INFO  16:18:54,653 StrandBiasTest - SAM/BAM data was found. Attempting to use read data to calculate strand bias annotations values.
    INFO  16:18:57,585 ProgressMeter -            done     74850.0     2.0 s      39.0 s       99.9%     2.0 s       0.0 s
    INFO  16:18:57,585 ProgressMeter - Total runtime 2.98 secs, 0.05 min, 0.00 hours
    INFO  16:18:57,590 MicroScheduler - 109 reads were filtered out during the traversal out of approximately 4554 total reads (2.39%)
    INFO  16:18:57,590 MicroScheduler -   -> 93 reads (2.04% of total) failing BadMateFilter
    INFO  16:18:57,591 MicroScheduler -   -> 0 reads (0.00% of total) failing DuplicateReadFilter
    INFO  16:18:57,591 MicroScheduler -   -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
    INFO  16:18:57,592 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter
    INFO  16:18:57,592 MicroScheduler -   -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter
    INFO  16:18:57,593 MicroScheduler -   -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter
    INFO  16:18:57,593 MicroScheduler -   -> 16 reads (0.35% of total) failing UnmappedReadFilter
    WARN  16:18:59,117 RestStorageService - Error Response: PUT '/qheODGuSqbELnioHLUk9v353xAJ1rwcd.report.xml.gz' -- ResponseCode: 403, ResponseStatus: Forbidden, Request Headers: [Content-Length: 397, Content-MD5: 2p38VGbP3uPfO8rDLw39kA==, Content-Type: application/octet-stream, x-amz-meta-md5-hash: da9dfc5466cfdee3df3bcac32f0dfd90, Date: Fri, 08 May 2015 21:18:58 GMT, Authorization: AWS AKIAI22FBBJ37D5X62OQ:+qZ2PaFrK/RE6siH04K2xWRV2fs=, User-Agent: JetS3t/0.8.1 (Linux/2.6.32-431.5.1.el6.x86_64; amd64; en; JVM 1.7.0_51), Host: broad.gsa.gatk.run.reports.s3.amazonaws.com, Expect: 100-continue], Response Headers: [x-amz-request-id: A5F6D5659C08C440, x-amz-id-2: mBCoflGPYIDx+QRSvK4rwnn84Rp5zx3B7sxooBsNqtr4j72R+HbJTAZlyvFT5DkU7EiAHbrqrFo=, Content-Type: application/xml, Transfer-Encoding: chunked, Date: Fri, 08 May 2015 22:07:17 GMT, Connection: close, Server: AmazonS3]
    WARN  16:18:59,179 RestStorageService - Adjusted time offset in response to RequestTimeTooSkewed error. Local machine and S3 server disagree on the time by approximately 2897 seconds. Retrying connection.
    INFO  16:18:59,389 GATKRunReport - Uploaded run statistics report to AWS S3
    

    I had previously used Unified Genotyper and it did give me all variants in the output file. Not sure what is happening now.

  • tommycarstensentommycarstensen United KingdomPosts: 336Member ✭✭✭

    @tinu It should not matter, but can you try with -stand_call_conf 0 and -stand_emit_conf 0? Any reason you are using -U LENIENT_VCF_PROCESSING? Can you print an example of a variant, which fails to be called?

  • SheilaSheila Broad InstitutePosts: 1,663Member, GATK Dev, Broadie, Moderator, DSDE Dev admin

    @tinu
    Hi,

    This thread may help: http://gatkforums.broadinstitute.org/discussion/5520/is-there-a-conflict-between-emit-all-confident-sites-and-genotype-given-alleles#latest

    Also, Tommy @tommycarstensen is correct about the vcf maybe having errors. Please try validating your vcf.

    Thanks,
    Sheila

  • tinutinu Posts: 44Member

    @tommycarstensen I tried using both-stand_call_conf 0 and -stand_emit_conf 0. Still the same issue, not all variants given as -alleles is there in the GGA output.

    I am pasting few variants which were not present in the output, but were part of the input file

    chr21   10906855        .       G       A       167.02  VQSRTrancheSNP99.60to99.80      AC=1;AF=1.330e-05;AN=75212;BaseQRankSum=-1.271e+00;CCC=75212;ClippingRankSum=2.20;DP=1989849;FS=0.000;GQ_MEAN=89.17;GQ_STDDEV=27.18;HWP=1.0000;InbreedingCoeff=0.0004;MLEAC=1;MLEAF=1.330e-05;MQ=53.65;MQ0=0;MQRankSum=-3.030e+00;NCC=1;NEGATIVE_TRAIN_SITE;QD=3.21;ReadPosRankSum=-2.096e+00;VQSLOD=-6.788e+00;culprit=QD;set=FilteredInAll
    chr21   10906856        .       G       T       73.02   VQSRTrancheSNP99.60to99.80      AC=1;AF=1.330e-05;AN=75212;BaseQRankSum=-1.699e+00;CCC=75212;ClippingRankSum=-5.190e-01;DP=1989828;FS=2.374;GQ_MEAN=89.22;GQ_STDDEV=27.09;HWP=1.0000;InbreedingCoeff=0.0008;MLEAC=1;MLEAF=1.330e-05;MQ=58.36;MQ0=0;MQRankSum=0.411;NCC=1;NEGATIVE_TRAIN_SITE;QD=1.12;ReadPosRankSum=-3.040e-01;VQSLOD=-2.658e+00;culprit=QD;set=FilteredInAll
    chr21   10906858        .       T       G,C     14393.03        VQSRTrancheSNP99.60to99.80      AC=164,1;AF=2.181e-03,1.330e-05;AN=75212;BaseQRankSum=-5.703e+00;CCC=75212;ClippingRankSum=2.51;DP=1952326;FS=17.557;GQ_MEAN=85.70;GQ_STDDEV=30.69;HWP=1.0000;InbreedingCoeff=0.0003;MLEAC=144,1;MLEAF=1.915e-03,1.330e-05;MQ=54.45;MQ0=0;MQRankSum=2.53;NCC=1;NEGATIVE_TRAIN_SITE;QD=1.11;ReadPosRankSum=1.19;VQSLOD=-6.531e+00;culprit=FS;set=FilteredInAll
    chr21   10906864        .       A       C       361.02  VQSRTrancheSNP99.60to99.80      AC=1;AF=1.330e-05;AN=75212;BaseQRankSum=0.825;CCC=75212;ClippingRankSum=1.22;DP=1945929;FS=3.564;GQ_MEAN=88.71;GQ_STDDEV=26.94;HWP=1.0000;InbreedingCoeff=0.0009;MLEAC=1;MLEAF=1.330e-05;MQ=57.04;MQ0=0;MQRankSum=0.332;NCC=1;NEGATIVE_TRAIN_SITE;QD=2.54;ReadPosRankSum=0.086;VQSLOD=-1.969e+00;culprit=QD;set=FilteredInAll
    chr21   10906865        .       G       C       160.02  VQSRTrancheSNP99.60to99.80      AC=1;AF=1.330e-05;AN=75212;BaseQRankSum=-1.920e+00;CCC=75212;ClippingRankSum=-1.220e-01;DP=1946148;FS=0.000;GQ_MEAN=88.69;GQ_STDDEV=26.91;HWP=1.0000;InbreedingCoeff=0.0007;MLEAC=1;MLEAF=1.330e-05;MQ=58.27;MQ0=0;MQRankSum=-2.420e-01;NCC=1;NEGATIVE_TRAIN_SITE;QD=0.96;ReadPosRankSum=-5.970e-01;VQSLOD=-3.119e+00;culprit=QD;set=FilteredInAll
    chr21   10906866        .       A       T       306.28  VQSRTrancheSNP99.60to99.80      AC=2;AF=2.659e-05;AN=75212;BaseQRankSum=-5.560e-01;CCC=75212;ClippingRankSum=0.169;DP=1946211;FS=0.000;GQ_MEAN=88.73;GQ_STDDEV=26.87;HWP=1.0000;InbreedingCoeff=0.0008;MLEAC=2;MLEAF=2.659e-05;MQ=60.00;MQ0=0;MQRankSum=2.40;NCC=1;NEGATIVE_TRAIN_SITE;QD=1.76;ReadPosRankSum=0.960;VQSLOD=-2.343e+00;culprit=QD;set=FilteredInAll
    chr21   10906868        .       CATA    C,TATA  233.71  VQSRTrancheINDEL99.00to99.50    AC=1,1;AF=1.330e-05,1.330e-05;AN=75212;BaseQRankSum=-1.123e+00;CCC=75212;ClippingRankSum=-1.058e+00;DP=1946484;FS=1.668;GQ_MEAN=88.72;GQ_STDDEV=26.87;HWP=1.0000;InbreedingCoeff=0.0009;MLEAC=1,1;MLEAF=1.330e-05,1.330e-05;MQ=56.33;MQ0=0;MQRankSum=0.770;NCC=1;QD=2.57;ReadPosRankSum=0.963;VQSLOD=-5.935e-01;culprit=QD;set=FilteredInAll
    chr21   10906870        .       T       C       667.02  VQSRTrancheSNP99.60to99.80      AC=1;AF=1.330e-05;AN=75212;BaseQRankSum=1.63;CCC=75212;ClippingRankSum=-5.080e-01;DP=1946781;FS=0.000;GQ_MEAN=88.76;GQ_STDDEV=27.02;HWP=1.0000;InbreedingCoeff=0.0009;MLEAC=1;MLEAF=1.330e-05;MQ=45.95;MQ0=0;MQRankSum=0.322;NCC=1;NEGATIVE_TRAIN_SITE;QD=5.02;ReadPosRankSum=1.19;VQSLOD=-1.779e+00;culprit=QD;set=FilteredInAll
    chr21   10906872        .       A       G       531.02  VQSRTrancheSNP99.60to99.80      AC=1;AF=1.330e-05;AN=75212;BaseQRankSum=-3.022e+00;CCC=75212;ClippingRankSum=-1.850e-01;DP=1947078;FS=0.000;GQ_MEAN=88.77;GQ_STDDEV=26.92;HWP=1.0000;InbreedingCoeff=0.0007;MLEAC=1;MLEAF=1.330e-05;MQ=59.56;MQ0=0;MQRankSum=-1.247e+00;NCC=1;NEGATIVE_TRAIN_SITE;QD=2.75;ReadPosRankSum=2.07;VQSLOD=-3.958e+00;culprit=QD;set=FilteredInAll
    chr21   10906875        .       CT      AT,TT,GT,C      3201.09 VQSRTrancheINDEL99.00to99.50    AC=10,1,1,1;AF=1.330e-04,1.330e-05,1.330e-05,1.330e-05;AN=75212;BaseQRankSum=-1.916e+00;CCC=75212;ClippingRankSum=-6.650e-01;DP=1947549;FS=0.880;GQ_MEAN=88.83;GQ_STDDEV=27.25;HWP=1.0000;InbreedingCoeff=0.0003;MLEAC=10,1,1,1;MLEAF=1.330e-04,1.330e-05,1.330e-05,1.330e-05;MQ=58.37;MQ0=0;MQRankSum=0.371;NCC=1;QD=2.83;ReadPosRankSum=1.06;VQSLOD=-6.668e-01;culprit=InbreedingCoeff;set=FilteredInAll
    
  • pdexheimerpdexheimer Posts: 483Member, GATK Dev, DSDE Dev mod

    Aha! They're all filtered, so the engine never sees them to genotype those alleles. I don't believe there's an option to ignore filters on -alleles (but check, there might be!), but there is an option to strip filters in SelectVariants - so you should be able to remove the filters first, and then send it UG

  • tinutinu Posts: 44Member
    edited May 11

    @pdexheimer I tried to empty the INFO field by putting the value of INFO field as '.' and still those variants were not in the output.

    chr21   10906855        .       G       A       167.02  VQSRTrancheSNP99.60to99.80      .
    chr21   10906856        .       G       T       73.02   VQSRTrancheSNP99.60to99.80      .
    chr21   10906858        .       T       G,C     14393.03        VQSRTrancheSNP99.60to99.80      .
    chr21   10906864        .       A       C       361.02  VQSRTrancheSNP99.60to99.80      .
    chr21   10906865        .       G       C       160.02  VQSRTrancheSNP99.60to99.80      .
    chr21   10906866        .       A       T       306.28  VQSRTrancheSNP99.60to99.80      .
    chr21   10906868        .       CATA    C,TATA  233.71  VQSRTrancheINDEL99.00to99.50    .
    chr21   10906870        .       T       C       667.02  VQSRTrancheSNP99.60to99.80      .
    chr21   10906872        .       A       G       531.02  VQSRTrancheSNP99.60to99.80      .
    chr21   10906875        .       CT      AT,TT,GT,C      3201.09 VQSRTrancheINDEL99.00to99.50    .
    
    Post edited by tinu on
  • pdexheimerpdexheimer Posts: 483Member, GATK Dev, DSDE Dev mod

    Well, that makes sense. The INFO field has nothing to do with filters. Try clearing out the FILTER column

  • tinutinu Posts: 44Member

    @Sheila According to your thread, if I give just Genotype_Given_Alleles, I just get 254 variants input and
    with EMIT_ALL_CONFIDENT_SITES the output has 2583 variants
    with EMIT_ALl_SITES, the ouput has 2533 variants

    My input has 3672 variants

  • tinutinu Posts: 44Member

    @pdexheimer OK so now I made the FILTER and INFO field as '.' and tried running again and now my input and output have the same number of variants.

    Thanks for helping out.

    Could you tell why does the FILTER field values affect the GENOTYPE_GIVEN_ALLELE mode of variant calling

  • pdexheimerpdexheimer Posts: 483Member, GATK Dev, DSDE Dev mod

    By default, the engine ignores filtered variants in vcfs. After all, if they were confident sites, they wouldn't be filtered! Most of the walkers that take vcfs as input have an option to ignore the filters, but I don't think such an option exists for the interval processing unit

Sign In or Register to comment.