EMIT_ALL_SITES in HaplotypeCaller

eflynn90eflynn90 Washington DCPosts: 56Member

If I run HaplotypeCaller with a VCF file as the intervals file, -stand_emit_conf 0, and -out_mode EMIT_ALL_SITES, should I get back an output VCF with all the sites from the input VCF, whether or not there was a variant call there? If not, is there a way to force output even if the calls are 0/0 or ./. for everyone in the cohort?

I have been trying to run HC with the above options, but I can't understand why some variants are included in my output file and others aren't. Some positions are output with no alternate allele and GTs of 0 for everyone. However, other positions that I know have coverage are not output at all.

Thanks,

Elise

Answers

  • pdexheimerpdexheimer Posts: 387Member, GSA Collaborator ✭✭✭✭

    I'm not speaking from a great deal of experience here (this isn't my typical workflow), but it sounds like you want to use GENOTYPE_GIVEN_ALLELES (the -gt_mode and -alleles arguments).

    I can't offer a good explanation for the behavior of EMIT_ALL_SITES, except that it's very difficult to emit all potential indels - so it may just skip them entirely (or only calculate a fixed subset, or something like that)

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,822Administrator, GATK Developer admin

    I agree that it sounds like you want to use GENOTYPE_GIVEN_ALLELES rather than EMIT_ALL_SITES, if the point is to get calls at the specific sites in a VCF.

    Note that starting with version 2.7, EMIT_ALL_SITES will no longer be available for HaplotypeCaller. Instead, you'll want to use the -ERC argument, which outputs sites called as reference along with a confidence score that they are actually reference.

    Geraldine Van der Auwera, PhD

  • eflynn90eflynn90 Washington DCPosts: 56Member

    So, I tried changing around the variables, but I still cannot get HC to output all sites. My original interval file was a vcf with 83 positions. With the following arguments I get different outputs, but none contains all 83 positions.

    -L sites.vcf -stand_emit_conf 0 -out_mode EMIT_ALL_SITES: 46 positions

    -L sites_notpadded.bed -stand_emit_conf 0 -out_mode EMIT_ALL_SITES: 46 positions

    -L sites_paddedby55bases.bed -stand_emit_conf 0 -out_mode EMIT_ALL_SITES: 79 positions (not all overlap with the original vcf though)

    -L sites.vcf -alleles sites.vcf -gt_mode GENOTYPE_GIVEN_ALLELES -stand_emit_conf 0: 26 positions

    -L sites.vcf -alleles sites.vcf -gt_mode GENOTYPE_GIVEN_ALLELES -stand_emit_conf 0 -out_mode EMIT_ALL_SITES: 59 positions

    -L sites_notpadded.bed -alleles sites.vcf -gt_mode GENOTYPE_GIVEN_ALLELES -stand_emit_conf 0 -out_mode EMIT_ALL_SITES: 66 positions

    -L sites_paddedby55bases.bed -alleles file.vcf -gt_mode GENOTYPE_GIVEN_ALLELES -stand_emit_conf 0 -out_mode EMIT_ALL_SITES: 80 positions

    -L sites_paddedby110bases.bed -alleles file.vcf -gt_mode GENOTYPE_GIVEN_ALLELES -stand_emit_conf 0 -out_mode EMIT_ALL_SITES: 80 positions

    I briefly looked at the bam files and there are many reads which overlap the uncalled positions.

    I am using GATK v2.6-5-gba531bd, Compiled 2013/07/18 18:05:31

  • eflynn90eflynn90 Washington DCPosts: 56Member

    Ah, okay, I see one problem. For the last two runs (with padded bed files, gt given alleles, and emit all sites), the three SNPs which were not called were labeld "LowQual" in the -alleles vcf. So that makes sense if HC is set up not to genotype places that are labeled as "LowQual" in the FILTER column.

  • eflynn90eflynn90 Washington DCPosts: 56Member

    However, I still don't understand why all sites were not output when the intervals file was a vcf/the unpadded bed file. It was not a question of different length ref alleles, because most of the sites that weren't output were SNPs.

  • eflynn90eflynn90 Washington DCPosts: 56Member

    Either way I'll be able to run HC to genotype at those locations now, so thanks for your help!

  • eflynn90eflynn90 Washington DCPosts: 56Member

    I have one more question about using GENOTYPE_GIVEN_ALLELES. In the documentation, you warn about increasing the max alleles argument because it will drastically slow down HC. Is this true even then you are using the given alleles argument?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,822Administrator, GATK Developer admin

    It depends how you are using GGA mode. If you are passing in the vcf of alleles as intervals list as well (using -L) then the max alleles will be irrelevant because the caller will only genotype those sites with the given alleles. If you don't pass in the sites as intervals, the caller will also attempt to call all other sites in the genome normally (in addition to doing the given sites in GGA mode), and for that the max alleles are still relevant.

    Geraldine Van der Auwera, PhD

  • eflynn90eflynn90 Washington DCPosts: 56Member

    So, if GGA mode is meant to function with the same vcf as the intervals and alleles file, why am I not getting all sites output when I use the vcf as the intervals file?

    Out of 80 valid positions in the vcf file, I got 59 positions with the vcf as the intervals file, and I got 80 positions with the padded bed file as the intervals file. (All other settings were the same and the vcf was the alleles file in both runs.) The 80 positions with the padded bed file were all variants that were in the original vcf file.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,822Administrator, GATK Developer admin

    Hmm. Can you post an example of what variants were left out when using the vcf file as intevals?

    Geraldine Van der Auwera, PhD

  • eflynn90eflynn90 Washington DCPosts: 56Member
    edited August 2013

    For example, this was in the alleles/intervals file:

    1       14773   .       C       T       720.92  .       .  
    1       14930   rs75454623      A       G       34793.34        .       .  
    1       14933   rs199856693     G       A       685.76  PASS    .  
    1       14948   rs201855936     G       A       590.15  .       .
    1       14976   rs71252251      G       A       663.76  .       .
    1       15029   rs201045431     G       A       269.33  .       .
    
    Post edited by Geraldine_VdAuwera on
  • eflynn90eflynn90 Washington DCPosts: 56Member
    edited August 2013

    And this was in the output vcf:

    1       14773   .       C       T       631.54  .       AC=3;AF=0.250;AN=12;BaseQRankSum=-0.828;ClippingRankSum=-0.028;DP=1449;FS=12.504;MLEAC=3;MLEAF=0.250;MQ=39.59;MQ0=0;MQRankSum=4.766;QD=0.85;ReadPosRankSum=-0.150       GT:AD:DP:GQ:PL  0/1:204,43:247:99:174,0,4860    0/0:270,0:270:99:0,808,7629     0/0:133,0:133:99:0,399,3600......
    
    1       15029   rs201045431     G       A       0       LowQual AC=0;AF=0.00;AN=12;DB;DP=946;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=31.30;MQ0=0 GT:AD:DP:GQ:PL  0/0:131,0:131:99:0,385,3746     0/0:213,0:213:99:0,633,6199     0/0:82,0:82:99:0,241,2293       0/0:167,0:167:99:0,494,4487     0/0:135,0:135:99:0,395,3647     0/0:218,0:218:99:0,654,6411.........
    
    Post edited by Geraldine_VdAuwera on
  • eflynn90eflynn90 Washington DCPosts: 56Member

    Sorry for the spacing. But basically of those six alleles, only two were output.

  • eflynn90eflynn90 Washington DCPosts: 56Member

    This is the output in using the padded bed file as the intervals file:

    1 14773 . C T 644.54 . AC=3;AF=0.250;AN=12;BaseQRankSum=-1.005;ClippingRankSum=0.337;DP=1419;FS=14.707;MLEAC=3;MLEAF=0.250;MQ=39.50;MQ0=0;MQRankSum=3.588;QD=0.90;ReadPosRankSum=-0.993 GT:AD:DP:GQ:PL 0/1:189,45:234:99:313,0,4469 0/0:270,0:270:99:0,808,7613 0/0:136,0:136:99:0,408,3695 0/1:

    1 14930 rs75454623 A G 30412.34 . AC=6;AF=0.500;AN=12;BaseQRankSum=-1.491;ClippingRankSum=-2.096;DB;DP=1989;FS=37.780;MLEAC=6;MLEAF=0.500;MQ=42.43;MQ0=0;MQRankSum=-4.878;QD=15.29;ReadPosRankSum=2.487 GT:AD:DP:GQ:PL 0/1:241,59:300:99:1287,0,5974 0/1:167,215:382:99:8066,0,3769 0/1:154,55:209:99:13

    1 14933 rs199856693 G A 1146.76 . AC=2;AF=0.167;AN=12;BaseQRankSum=-4.169;ClippingRankSum=-3.201;DB;DP=1992;FS=14.145;MLEAC=2;MLEAF=0.167;MQ=42.68;MQ0=0;MQRankSum=-6.500;QD=1.42;ReadPosRankSum=0.657 GT:AD:DP:GQ:PL 0/0:299,0:299:99:0,888,8528 0/0:382,0:382:99:0,1145,11499 0/0:208,0:208:99:0,620,6005

    1 14948 rs201855936 G A 0 LowQual AC=0;AF=0.00;AN=12;BaseQRankSum=0.615;ClippingRankSum=1.003;DB;DP=1747;FS=3.429;MLEAC=0;MLEAF=0.00;MQ=40.15;MQ0=0;MQRankSum=0.321;ReadPosRankSum=-1.497 GT:AD:DP:GQ:PL 0/0:262,0:262:99:0,833,10392 0/0:332,0:332:99:0,1244,22062 0/0:182,0:182:99:0,595,8088 0/0:333,0:33

    1 14976 rs71252251 G A 0 LowQual AC=0;AF=0.00;AN=12;BaseQRankSum=-1.070;ClippingRankSum=-4.000;DB;DP=1362;FS=20.859;MLEAC=0;MLEAF=0.00;MQ=37.27;MQ0=0;MQRankSum=-4.166;ReadPosRankSum=3.406 GT:AD:DP:GQ:PL 0/0:212,1:213:99:0,716,7259 0/0:265,0:265:99:0,1234,13930 0/0:136,0:136:99:0,501,5256 0/0:

    1 15029 rs201045431 G A 0 LowQual AC=0;AF=0.00;AN=12;DB;DP=931;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=31.19;MQ0=0 GT:AD:DP:GQ:PL 0/0:131,0:131:99:0,385,3746 0/0:203,0:203:99:0,603,5922 0/0:83,0:83:99:0,244,2319 0/0:161,0:161:99:0,542,4982 0/0:133,0:133:99:0,389,3593 0/0:219,0:219:99:0,736,7158

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,822Administrator, GATK Developer admin

    I'm not sure what's happening here. I'll ask @ebanks to have a look.

    Geraldine Van der Auwera, PhD

  • KatherineSKatherineS Posts: 9Member

    Hi, I'm having a similar problem running HaplotypeCaller v2.7-2-g6bda569 in GGA mode. The input VCF (-L and -alleles argument) file contains 2,186 variants, but the output VCF only contains 1,874. When I genotype with UnifiedGenotyper v2.7-2-g6bda569, the output file contains all 2,186 variants in the input file. I've listed the command lines used below.

    More details:
    * The output VCF does contain some variants where both of the input BAM files are genotyped as reference 0/0, and where both are called as missing ./., so these categories aren't being systematically excluded.
    * At least some of the variants not being output by HaplotypeCaller have coverage
    * There are no variants marked as LowQual in the input VCF.
    * The input file is single-sample and all variants in it have genotype 0/1.

    Any ideas on how to get HaplotypeCaller to genotype all 2,186 variants would be much appreciated.

    Thanks,

    Katherine

    HaplotypeCaller

    java -Xmx4g -jar GenomeAnalysisTK.jar \
    -T HaplotypeCaller \
    -R ${reference} \
    -I sample1.bam \
    -I sample2.bam \
    -L $vcfIn \
    -alleles $vcfIn \
    -o output1.vcf \
    -gt_mode GENOTYPE_GIVEN_ALLELES \
    -stand_call_conf 0.0 \
    -stand_emit_conf 0.0

    UnifiedGenotyper

    java -Xmx4g -jar GenomeAnalysisTK.jar \
    -T UnifiedGenotyper \
    -R ${reference} \
    -I sample1.bam \
    -I sample2.bam \
    -L $vcfIn \
    -alleles $vcfIn \
    -o output2.vcf \
    -gt_mode GENOTYPE_GIVEN_ALLELES \
    -out_mode EMIT_ALL_SITES \
    -stand_call_conf 0.0 \
    -stand_emit_conf 0.0

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,822Administrator, GATK Developer admin

    To find out what's going on here, I suggest you identify a couple of the variants that are disappearing when you run HC in GGA mode, then run again and use the option to generate an output BAM of what the HC reassembly step produced. Have a look at them in IGV and contrast vs. the original assembly and the original vs new calls. If the reassembly is radically changing the picture of what variation is present it might affect how the variants are output.

    Geraldine Van der Auwera, PhD

  • KatherineSKatherineS Posts: 9Member

    Hi Geraldine,

    I looked at the first ten variants in the input VCF, of which only four were genotyped. I extracted these ten variants to a new VCF and then tried to genotype them using --bamOutput out.bam and --bamWriterType CALLED_HAPLOTYPES. Five of the six variants that are not genotyped have no coverage in the HC output BAM - these are all located within ~ 20 bp. The sixth variant that is not genotyped has a few reads of coverage. All have coverage in the original BAM file.

    I tried a simplified experiment where I called variants for a single sample using HaplotypeCaller (restricting analysis to a single chromosome). This yielded 1290 variants (it's a low coverage sample). I then attempted to genotype the same sample at the locations of the variants just detected from that sample. Only 1075 variants were output by the genotyping step. Commands used:

    Variant calling

    java -Xmx4g -jar GenomeAnalysisTK.jar \
    -T HaplotypeCaller \
    -R ${reference} \
    -I NA18507.bam \
    -L chr21 \
    -o NA18507.var.hc.chr21.vcf \
    -stand_call_conf 20 \
    -stand_emit_conf 20

    Genotyping

    java -Xmx4g -jar GenomeAnalysisTK.jar \
    -T HaplotypeCaller \
    -R ${reference} \
    -I NA18507.bam \
    -L NA18507.var.hc.chr21.vcf \
    -alleles NA18507.var.hc.chr21.vcf \
    -o NA18507.geno.hc.chr21.vcf \
    -gt_mode GENOTYPE_GIVEN_ALLELES \
    -stand_call_conf 0.0 \
    -stand_emit_conf 0.0

    I would have expected all variant lines in the VCF file output by variant calling to be present in the VCF file output by genotyping. Am I missing something? Thanks for your thoughts!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,822Administrator, GATK Developer admin

    Hi @KatherineS,

    The issue here might be that HaplotypeCaller does not do well on single-base intervals. This is something I had not run into before, so my apologies for the lack of documentation, but for running HC in GGA mode, you actually need to pad the intervals with e.g. 150 bp on either side. You can do this automatically with the --interval_padding argument. Please try this and let me know if it solves your problem.

    Geraldine Van der Auwera, PhD

  • KatherineSKatherineS Posts: 9Member

    Hi @Geraldine_VdAuwera,

    That definitely helps! Adding -ip 150 to the genotyping step results in 1290/1290 variant lines being output for the simplified experiment. For my application of interest, it results in 2179/2186 variant lines being output. -ip 200 yields 2185/2186. The remaining variant, rs12012803, isn't genotyped even at -ip 20000.

    Thank you so much for your help - I'm hugely grateful.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,822Administrator, GATK Developer admin

    Glad to hear it :)

    I'll document this as a gotcha for running HC in GGA mode.

    Geraldine Van der Auwera, PhD

  • YingYing Posts: 1Member

    Hi @Geraldine_VdAuwera‌ and @eflynn90‌ I am just wondering why HaplotypeCaller picks up the site where every sample has the 0/0 genotype? See the last line of VCF file posted by @eflynn90‌ It seems that the site is not a variant across all the samples. Any explanations, @Geraldine_VdAuwera‌ ?

    @Geraldine_VdAuwera said: I'm not sure what's happening here. I'll ask ebanks to have a look. @eflynn90 said: This is the output in using the padded bed file as the intervals file: ... 1 15029 rs201045431 G A 0 LowQual AC=0;AF=0.00;AN=12;DB;DP=931;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=31.19;MQ0=0 GT:AD:DP:GQ:PL 0/0:131,0:131:99:0,385,3746 0/0:203,0:203:99:0,603,5922 0/0:83,0:83:99:0,244,2319 0/0:161,0:161:99:0,542,4982 0/0:133,0:133:99:0,389,3593 0/0:219,0:219:99:0,736,7158 …

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,822Administrator, GATK Developer admin

    Hi @Ying,

    The line posted here is incomplete, maybe there are samples that are not shown that have the variant. Are you seeing something strange in your own data?

    Geraldine Van der Auwera, PhD

  • ekanterakisekanterakis UKPosts: 10Member

    Hi Geraldine, I'm using GATK (3.2-2) HaplotypeCaller and wish to emit all sites (i.e. variants and reference) to my vcf genome-wide. You suggested using the -ERC argument which takes in either BP_RESOLUTION or GVCF. What are the differences between those two options and which one gives me a confidence score for the reference positions? Has anyone tried this option genome-wide? Thanks, Stathis

  • SheilaSheila Broad InstitutePosts: 757Member, GATK Developer, Broadie, Moderator admin

    @ekanterakis

    Hello Stathis,

    BP_RESOLUTION will give you every single site (variant and non-variant) as a record, and it will contain the GQ scores.

    GVCF will give you all the sites, but the non-variant sites will be grouped as blocks. The blocks state a start and end position, and they are grouped based on GQ. However, the GQ is not stated in the GVCF. If you run GenotypeGVCFs on the GVCF, you will get the GQ.

    Because BP_RESOLUTION outputs a record for every single site, it does take a lot of time and space. GVCF mode was developed to take less time and space and is more frequently used in genome-wide analysis.

    Please have a look at this article as well: https://www.broadinstitute.org/gatk/guide/article?id=4017

    -Sheila

  • ekanterakisekanterakis UKPosts: 10Member

    Thanks Sheila. I have a follow-up question. Is using BP_RESOLUTION/GVCF going to affect Variant Recalibration downstream? Should I split my variant and non-variant sites and combine them later? -Stathis

  • SheilaSheila Broad InstitutePosts: 757Member, GATK Developer, Broadie, Moderator admin

    @ekanterakis

    Hi Stathis,

    Because variant recalibration is run on SNPs separately and indels separately, you will use SelectVariants to create new vcfs of only SNPs and only indels. You will then run variant recalibration on each of those vcfs. The non-variant sites will not be included in either of the new vcfs, so you do not need to worry about them.

    -Sheila

  • SheilaSheila Broad InstitutePosts: 757Member, GATK Developer, Broadie, Moderator admin

    @ekanterakis

    Hi again Stathis,

    Please ignore my above comment if you are going to run VQSR. I was thinking about hard filtering when I wrote that.

    Because the GVCF is just an intermediate, you will run GenotypeGVCFs on all of your GVCFs, and get a file without non-variants. If you would like to retain the non-variants, you can use the --includeNonVariantSites argument. If you do use the --includeNonVariantSites argument, it doesn't matter because VQSR will just ignore non-variants, and they will be carried on to the output without modification.

    For VQSR, you do not need to run SelectVariants. You can just run all the steps serially on the same original input file, ending up with everything properly recalibrated in one output file (after the half-recalibrated intermediate). Please read more about VQSR workflow here: https://www.broadinstitute.org/gatk/guide/article?id=2805

    -Sheila

  • ekanterakisekanterakis UKPosts: 10Member

    Hi Sheila, I'm running HaplotypeCaller by chromosome and chunking (50Mb at a time). I'm using -ERC BP_RESOLUTION to get every single position. I would then like to run VariantRecalibrator -mode BOTH (can this be done on each of the chunks separately??) and then ApplyRecalibration (this needs to be done separately for SNPs and Indels, right?). Is this a good approach? Where will my reference positions go after ApplyRecalibration? Will they be in the SNPs file? If I then CombineVariants for all chunks of all chromosomes do I get a complete, recalibrated genome vcf? Thank you again for your help. -Stathis

  • SheilaSheila Broad InstitutePosts: 757Member, GATK Developer, Broadie, Moderator admin

    @ekanterakis

    Hi Stathis,

    You cannot use -BOTH for VQSR at any point, nor can you chunk VQSR jobs.

    You are correct about the ApplyRecalibration step being done separately on SNPs and indels. Your reference positions will be in the final vcf file, but they will not be recalibrated n the process.

    -Sheila

  • FerFer AustriaPosts: 16Member

    Hi, this is my first post in the blog, and it comes only after trying my best to look for a solution to my problem. I'm also using HaplotypeCaller in GENOTYPE_GIVEN_ALLELES mode, and I want it to EMIT_ALL_SITES that I've passed it cause it is important for downstream scripts (before used to run UG and got what I wanted, but now I want the benefits of HC). Even after playing around with --interval_padding parameter I still don't get all sites.
    For couple sites that were not emitted I've checked the bam files, and [no surprise] there were no reads spanning the region (I'm working with very low coverage ~1x data for hundreds of individuals). Last thing I've tried was to set to 0 -minReadsPerAlignStart just in case, but it was also unsuccessful. What can I do to get ALL passed sites even if they come as ./. as many others already do? Alberto

  • SheilaSheila Broad InstitutePosts: 757Member, GATK Developer, Broadie, Moderator admin

    @Fer

    Hi Alberto,

    Can you please provide us with what version of GATK you are using and the exact command line you used? Also, can you post some example lines from the VCF showing missing positions?

    Thanks, Sheila

  • FerFer AustriaPosts: 16Member
    edited November 7

    Thanks Sheila, I've tried with both versions 3.2 and 3.3.

    java -Xmx6g -jar GenomeAnalysisTK.jar / -R $ref / -T HaplotypeCaller / -I $acc.sort.mkdup.realigned.bam / -o $acc.hc.genotyped.vcf / --genotyping_mode GENOTYPE_GIVEN_ALLELES / -out_mode EMIT_ALL_SITES / -stand_call_conf 0.0 / -stand_emit_conf 0.0 / --interval_padding 200 / -minReadsPerAlignStart 0 / --alleles %s.%s.hc.snp.vcf / -L %s.%s.hc.snp.vcf

    I don't know if this is the example you're asking me for. It comes from the %s.%s.hc.snp.vcf file that provides the alleles of interest. Positions 1420 and 3102 get into the output file, while 2309 and 2310 don't.

    Chr1 1420 . T A 39.17 . AC=2;AF=0.500;AN=4;DP=11;FS=0.000;GQ_MEAN=31.69;GQ_STDDEV=22.03;InbreedingCoeff=0.8131;MQ=60.00;MQ0=0;NCC=0;QD=13.06 GT:AD:DP:GQ:PL 0/0:8,0:8:21:0,21,232 1/1:0,3:3:9:82,9,0

    Chr1 2309 . T A 4209.40 . AC=4;AF=1.00;AN=4;BaseQRankSum=-1.628e+00;ClippingRankSum=0.158;DP=14;FS=3.100;GQ_MEAN=34.00;GQ_STDDEV=13.68;InbreedingCoeff=0.9924;MQ=60.00;MQ0=0;MQRankSum=1.63;NCC=1;QD=26.88;ReadPosRankSum=1.31 GT:AD:DP:GQ:PL 1/1:0,8:8:24:359,24,0 1/1:0,6:6:18:270,18,0

    Chr1 2310 . C A 4209.40 . AC=4;AF=1.00;AN=4;BaseQRankSum=1.10;ClippingRankSum=0.683;DP=14;FS=3.100;GQ_MEAN=34.00;GQ_STDDEV=13.68;InbreedingCoeff=0.9924;MQ=60.00;MQ0=0;MQRankSum=-1.580e-01;NCC=1;QD=28.92;ReadPosRankSum=1.42 GT:AD:DP:GQ:PL 1/1:0,8:8:24:359,24,0 1/1:0,6:6:18:270,18,0

    Chr1 3102 . A G 2245.11 . AC=4;AF=1.00;AN=4;DP=13;FS=0.000;GQ_MEAN=29.69;GQ_STDDEV=17.70;InbreedingCoeff=0.9221;MQ=60.00;MQ0=0;NCC=0;QD=27.05 GT:AD:DP:GQ:PL 1/1:0,11:11:33:320,33,0 1/1:0,2:2:6:57,6,0

    Post edited by Fer on
  • SheilaSheila Broad InstitutePosts: 757Member, GATK Developer, Broadie, Moderator admin
    edited November 7

    @Fer

    Hi Alberto,

    The issue is that Haplotype Caller is not compatible with -out_mode EMIT_ALL_SITES. You must use -ERC BP_RESOLUTION instead. This will give you an intermediate GVCF which you will need to run in GenotypeGVCFs to give you the final vcf. Sorry for the confusion. I will fix this in the documentation asap.

    Please read more about -ERC BP_RESOLUTION and intermediate GVCFs here: https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php#--emitRefConfidence

    https://www.broadinstitute.org/gatk/guide/article?id=4017

    Once you have the output GVCF from running HaplotypeCaller with -ERC BP_RESOLUTION, to get the regular vcf with all sites, you will run GenotypeGVCFs with -allSites. Please read about it here: https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_GenotypeGVCFs.php#--includeNonVariantSites

    I hope this helps.

    -Sheila

    Post edited by Sheila on
  • FerFer AustriaPosts: 16Member

    Thanks, it helped to know. However, I was able to find my way around to get what I originally wanted: the output of GENOTYPE_GIVEN_ALLELES emitting all sites present in the vcf provided in --alleles parameter. I just had to run Combine variants afterwards. Alberto

Sign In or Register to comment.