EMIT_ALL_SITES in HaplotypeCaller

eflynn90eflynn90 Washington DCMember Posts: 56

If I run HaplotypeCaller with a VCF file as the intervals file, -stand_emit_conf 0, and -out_mode EMIT_ALL_SITES, should I get back an output VCF with all the sites from the input VCF, whether or not there was a variant call there? If not, is there a way to force output even if the calls are 0/0 or ./. for everyone in the cohort?

I have been trying to run HC with the above options, but I can't understand why some variants are included in my output file and others aren't. Some positions are output with no alternate allele and GTs of 0 for everyone. However, other positions that I know have coverage are not output at all.

Thanks,

Elise

Answers

  • pdexheimerpdexheimer Member, Dev Posts: 534 ✭✭✭✭

    I'm not speaking from a great deal of experience here (this isn't my typical workflow), but it sounds like you want to use GENOTYPE_GIVEN_ALLELES (the -gt_mode and -alleles arguments).

    I can't offer a good explanation for the behavior of EMIT_ALL_SITES, except that it's very difficult to emit all potential indels - so it may just skip them entirely (or only calculate a fixed subset, or something like that)

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,305 admin

    I agree that it sounds like you want to use GENOTYPE_GIVEN_ALLELES rather than EMIT_ALL_SITES, if the point is to get calls at the specific sites in a VCF.

    Note that starting with version 2.7, EMIT_ALL_SITES will no longer be available for HaplotypeCaller. Instead, you'll want to use the -ERC argument, which outputs sites called as reference along with a confidence score that they are actually reference.

    Geraldine Van der Auwera, PhD

  • eflynn90eflynn90 Washington DCMember Posts: 56

    So, I tried changing around the variables, but I still cannot get HC to output all sites. My original interval file was a vcf with 83 positions. With the following arguments I get different outputs, but none contains all 83 positions.

    -L sites.vcf -stand_emit_conf 0 -out_mode EMIT_ALL_SITES: 46 positions

    -L sites_notpadded.bed -stand_emit_conf 0 -out_mode EMIT_ALL_SITES: 46 positions

    -L sites_paddedby55bases.bed -stand_emit_conf 0 -out_mode EMIT_ALL_SITES: 79 positions (not all overlap with the original vcf though)

    -L sites.vcf -alleles sites.vcf -gt_mode GENOTYPE_GIVEN_ALLELES -stand_emit_conf 0: 26 positions

    -L sites.vcf -alleles sites.vcf -gt_mode GENOTYPE_GIVEN_ALLELES -stand_emit_conf 0 -out_mode EMIT_ALL_SITES: 59 positions

    -L sites_notpadded.bed -alleles sites.vcf -gt_mode GENOTYPE_GIVEN_ALLELES -stand_emit_conf 0 -out_mode EMIT_ALL_SITES: 66 positions

    -L sites_paddedby55bases.bed -alleles file.vcf -gt_mode GENOTYPE_GIVEN_ALLELES -stand_emit_conf 0 -out_mode EMIT_ALL_SITES: 80 positions

    -L sites_paddedby110bases.bed -alleles file.vcf -gt_mode GENOTYPE_GIVEN_ALLELES -stand_emit_conf 0 -out_mode EMIT_ALL_SITES: 80 positions

    I briefly looked at the bam files and there are many reads which overlap the uncalled positions.

    I am using GATK v2.6-5-gba531bd, Compiled 2013/07/18 18:05:31

  • eflynn90eflynn90 Washington DCMember Posts: 56

    Ah, okay, I see one problem. For the last two runs (with padded bed files, gt given alleles, and emit all sites), the three SNPs which were not called were labeld "LowQual" in the -alleles vcf. So that makes sense if HC is set up not to genotype places that are labeled as "LowQual" in the FILTER column.

  • eflynn90eflynn90 Washington DCMember Posts: 56

    However, I still don't understand why all sites were not output when the intervals file was a vcf/the unpadded bed file. It was not a question of different length ref alleles, because most of the sites that weren't output were SNPs.

  • eflynn90eflynn90 Washington DCMember Posts: 56

    Either way I'll be able to run HC to genotype at those locations now, so thanks for your help!

  • eflynn90eflynn90 Washington DCMember Posts: 56

    I have one more question about using GENOTYPE_GIVEN_ALLELES. In the documentation, you warn about increasing the max alleles argument because it will drastically slow down HC. Is this true even then you are using the given alleles argument?

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,305 admin

    It depends how you are using GGA mode. If you are passing in the vcf of alleles as intervals list as well (using -L) then the max alleles will be irrelevant because the caller will only genotype those sites with the given alleles. If you don't pass in the sites as intervals, the caller will also attempt to call all other sites in the genome normally (in addition to doing the given sites in GGA mode), and for that the max alleles are still relevant.

    Geraldine Van der Auwera, PhD

  • eflynn90eflynn90 Washington DCMember Posts: 56

    So, if GGA mode is meant to function with the same vcf as the intervals and alleles file, why am I not getting all sites output when I use the vcf as the intervals file?

    Out of 80 valid positions in the vcf file, I got 59 positions with the vcf as the intervals file, and I got 80 positions with the padded bed file as the intervals file. (All other settings were the same and the vcf was the alleles file in both runs.) The 80 positions with the padded bed file were all variants that were in the original vcf file.

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,305 admin

    Hmm. Can you post an example of what variants were left out when using the vcf file as intevals?

    Geraldine Van der Auwera, PhD

  • eflynn90eflynn90 Washington DCMember Posts: 56
    edited August 2013

    For example, this was in the alleles/intervals file:

    1       14773   .       C       T       720.92  .       .  
    1       14930   rs75454623      A       G       34793.34        .       .  
    1       14933   rs199856693     G       A       685.76  PASS    .  
    1       14948   rs201855936     G       A       590.15  .       .
    1       14976   rs71252251      G       A       663.76  .       .
    1       15029   rs201045431     G       A       269.33  .       .
    
    Post edited by Geraldine_VdAuwera on
  • eflynn90eflynn90 Washington DCMember Posts: 56
    edited August 2013

    And this was in the output vcf:

    1       14773   .       C       T       631.54  .       AC=3;AF=0.250;AN=12;BaseQRankSum=-0.828;ClippingRankSum=-0.028;DP=1449;FS=12.504;MLEAC=3;MLEAF=0.250;MQ=39.59;MQ0=0;MQRankSum=4.766;QD=0.85;ReadPosRankSum=-0.150       GT:AD:DP:GQ:PL  0/1:204,43:247:99:174,0,4860    0/0:270,0:270:99:0,808,7629     0/0:133,0:133:99:0,399,3600......
    
    1       15029   rs201045431     G       A       0       LowQual AC=0;AF=0.00;AN=12;DB;DP=946;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=31.30;MQ0=0 GT:AD:DP:GQ:PL  0/0:131,0:131:99:0,385,3746     0/0:213,0:213:99:0,633,6199     0/0:82,0:82:99:0,241,2293       0/0:167,0:167:99:0,494,4487     0/0:135,0:135:99:0,395,3647     0/0:218,0:218:99:0,654,6411.........
    
    Post edited by Geraldine_VdAuwera on
  • eflynn90eflynn90 Washington DCMember Posts: 56

    Sorry for the spacing. But basically of those six alleles, only two were output.

  • eflynn90eflynn90 Washington DCMember Posts: 56

    This is the output in using the padded bed file as the intervals file:

    1 14773 . C T 644.54 . AC=3;AF=0.250;AN=12;BaseQRankSum=-1.005;ClippingRankSum=0.337;DP=1419;FS=14.707;MLEAC=3;MLEAF=0.250;MQ=39.50;MQ0=0;MQRankSum=3.588;QD=0.90;ReadPosRankSum=-0.993 GT:AD:DP:GQ:PL 0/1:189,45:234:99:313,0,4469 0/0:270,0:270:99:0,808,7613 0/0:136,0:136:99:0,408,3695 0/1:

    1 14930 rs75454623 A G 30412.34 . AC=6;AF=0.500;AN=12;BaseQRankSum=-1.491;ClippingRankSum=-2.096;DB;DP=1989;FS=37.780;MLEAC=6;MLEAF=0.500;MQ=42.43;MQ0=0;MQRankSum=-4.878;QD=15.29;ReadPosRankSum=2.487 GT:AD:DP:GQ:PL 0/1:241,59:300:99:1287,0,5974 0/1:167,215:382:99:8066,0,3769 0/1:154,55:209:99:13

    1 14933 rs199856693 G A 1146.76 . AC=2;AF=0.167;AN=12;BaseQRankSum=-4.169;ClippingRankSum=-3.201;DB;DP=1992;FS=14.145;MLEAC=2;MLEAF=0.167;MQ=42.68;MQ0=0;MQRankSum=-6.500;QD=1.42;ReadPosRankSum=0.657 GT:AD:DP:GQ:PL 0/0:299,0:299:99:0,888,8528 0/0:382,0:382:99:0,1145,11499 0/0:208,0:208:99:0,620,6005

    1 14948 rs201855936 G A 0 LowQual AC=0;AF=0.00;AN=12;BaseQRankSum=0.615;ClippingRankSum=1.003;DB;DP=1747;FS=3.429;MLEAC=0;MLEAF=0.00;MQ=40.15;MQ0=0;MQRankSum=0.321;ReadPosRankSum=-1.497 GT:AD:DP:GQ:PL 0/0:262,0:262:99:0,833,10392 0/0:332,0:332:99:0,1244,22062 0/0:182,0:182:99:0,595,8088 0/0:333,0:33

    1 14976 rs71252251 G A 0 LowQual AC=0;AF=0.00;AN=12;BaseQRankSum=-1.070;ClippingRankSum=-4.000;DB;DP=1362;FS=20.859;MLEAC=0;MLEAF=0.00;MQ=37.27;MQ0=0;MQRankSum=-4.166;ReadPosRankSum=3.406 GT:AD:DP:GQ:PL 0/0:212,1:213:99:0,716,7259 0/0:265,0:265:99:0,1234,13930 0/0:136,0:136:99:0,501,5256 0/0:

    1 15029 rs201045431 G A 0 LowQual AC=0;AF=0.00;AN=12;DB;DP=931;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=31.19;MQ0=0 GT:AD:DP:GQ:PL 0/0:131,0:131:99:0,385,3746 0/0:203,0:203:99:0,603,5922 0/0:83,0:83:99:0,244,2319 0/0:161,0:161:99:0,542,4982 0/0:133,0:133:99:0,389,3593 0/0:219,0:219:99:0,736,7158

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,305 admin

    I'm not sure what's happening here. I'll ask @ebanks to have a look.

    Geraldine Van der Auwera, PhD

  • eflynn90eflynn90 Washington DCMember Posts: 56

    Thanks.

  • KatherineSKatherineS Member Posts: 9

    Hi, I'm having a similar problem running HaplotypeCaller v2.7-2-g6bda569 in GGA mode. The input VCF (-L and -alleles argument) file contains 2,186 variants, but the output VCF only contains 1,874. When I genotype with UnifiedGenotyper v2.7-2-g6bda569, the output file contains all 2,186 variants in the input file. I've listed the command lines used below.

    More details:

    • The output VCF does contain some variants where both of the input BAM files are genotyped as reference 0/0, and where both are called as missing ./., so these categories aren't being systematically excluded.
    • At least some of the variants not being output by HaplotypeCaller have coverage
    • There are no variants marked as LowQual in the input VCF.
    • The input file is single-sample and all variants in it have genotype 0/1.

    Any ideas on how to get HaplotypeCaller to genotype all 2,186 variants would be much appreciated.

    Thanks,

    Katherine

    HaplotypeCaller

    java -Xmx4g -jar GenomeAnalysisTK.jar \
    -T HaplotypeCaller \
    -R ${reference} \
    -I sample1.bam \
    -I sample2.bam \
    -L $vcfIn \
    -alleles $vcfIn \
    -o output1.vcf \
    -gt_mode GENOTYPE_GIVEN_ALLELES \
    -stand_call_conf 0.0 \
    -stand_emit_conf 0.0

    UnifiedGenotyper

    java -Xmx4g -jar GenomeAnalysisTK.jar \
    -T UnifiedGenotyper \
    -R ${reference} \
    -I sample1.bam \
    -I sample2.bam \
    -L $vcfIn \
    -alleles $vcfIn \
    -o output2.vcf \
    -gt_mode GENOTYPE_GIVEN_ALLELES \
    -out_mode EMIT_ALL_SITES \
    -stand_call_conf 0.0 \
    -stand_emit_conf 0.0

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,305 admin

    To find out what's going on here, I suggest you identify a couple of the variants that are disappearing when you run HC in GGA mode, then run again and use the option to generate an output BAM of what the HC reassembly step produced. Have a look at them in IGV and contrast vs. the original assembly and the original vs new calls. If the reassembly is radically changing the picture of what variation is present it might affect how the variants are output.

    Geraldine Van der Auwera, PhD

  • KatherineSKatherineS Member Posts: 9

    Hi Geraldine,

    I looked at the first ten variants in the input VCF, of which only four were genotyped. I extracted these ten variants to a new VCF and then tried to genotype them using --bamOutput out.bam and --bamWriterType CALLED_HAPLOTYPES. Five of the six variants that are not genotyped have no coverage in the HC output BAM - these are all located within ~ 20 bp. The sixth variant that is not genotyped has a few reads of coverage. All have coverage in the original BAM file.

    I tried a simplified experiment where I called variants for a single sample using HaplotypeCaller (restricting analysis to a single chromosome). This yielded 1290 variants (it's a low coverage sample). I then attempted to genotype the same sample at the locations of the variants just detected from that sample. Only 1075 variants were output by the genotyping step. Commands used:

    Variant calling

    java -Xmx4g -jar GenomeAnalysisTK.jar \
    -T HaplotypeCaller \
    -R ${reference} \
    -I NA18507.bam \
    -L chr21 \
    -o NA18507.var.hc.chr21.vcf \
    -stand_call_conf 20 \
    -stand_emit_conf 20

    Genotyping

    java -Xmx4g -jar GenomeAnalysisTK.jar \
    -T HaplotypeCaller \
    -R ${reference} \
    -I NA18507.bam \
    -L NA18507.var.hc.chr21.vcf \
    -alleles NA18507.var.hc.chr21.vcf \
    -o NA18507.geno.hc.chr21.vcf \
    -gt_mode GENOTYPE_GIVEN_ALLELES \
    -stand_call_conf 0.0 \
    -stand_emit_conf 0.0

    I would have expected all variant lines in the VCF file output by variant calling to be present in the VCF file output by genotyping. Am I missing something? Thanks for your thoughts!

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,305 admin

    Hi @KatherineS,

    The issue here might be that HaplotypeCaller does not do well on single-base intervals. This is something I had not run into before, so my apologies for the lack of documentation, but for running HC in GGA mode, you actually need to pad the intervals with e.g. 150 bp on either side. You can do this automatically with the --interval_padding argument. Please try this and let me know if it solves your problem.

    Geraldine Van der Auwera, PhD

  • KatherineSKatherineS Member Posts: 9

    Hi @Geraldine_VdAuwera,

    That definitely helps! Adding -ip 150 to the genotyping step results in 1290/1290 variant lines being output for the simplified experiment. For my application of interest, it results in 2179/2186 variant lines being output. -ip 200 yields 2185/2186. The remaining variant, rs12012803, isn't genotyped even at -ip 20000.

    Thank you so much for your help - I'm hugely grateful.

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,305 admin

    Glad to hear it :)

    I'll document this as a gotcha for running HC in GGA mode.

    Geraldine Van der Auwera, PhD

  • YingYing Member Posts: 1

    Hi @Geraldine_VdAuwera‌ and @eflynn90‌ I am just wondering why HaplotypeCaller picks up the site where every sample has the 0/0 genotype? See the last line of VCF file posted by @eflynn90‌ It seems that the site is not a variant across all the samples. Any explanations, @Geraldine_VdAuwera‌ ?

    @Geraldine_VdAuwera said:
    I'm not sure what's happening here. I'll ask ebanks to have a look.
    @eflynn90 said:
    This is the output in using the padded bed file as the intervals file:

    ...

    1 15029 rs201045431 G A 0 LowQual AC=0;AF=0.00;AN=12;DB;DP=931;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=31.19;MQ0=0 GT:AD:DP:GQ:PL 0/0:131,0:131:99:0,385,3746 0/0:203,0:203:99:0,603,5922 0/0:83,0:83:99:0,244,2319 0/0:161,0:161:99:0,542,4982 0/0:133,0:133:99:0,389,3593 0/0:219,0:219:99:0,736,7158

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,305 admin

    Hi @Ying,

    The line posted here is incomplete, maybe there are samples that are not shown that have the variant. Are you seeing something strange in your own data?

    Geraldine Van der Auwera, PhD

  • ekanterakisekanterakis UKMember Posts: 18

    Hi Geraldine,
    I'm using GATK (3.2-2) HaplotypeCaller and wish to emit all sites (i.e. variants and reference) to my vcf genome-wide. You suggested using the -ERC argument which takes in either BP_RESOLUTION or GVCF. What are the differences between those two options and which one gives me a confidence score for the reference positions? Has anyone tried this option genome-wide?
    Thanks,
    Stathis

  • SheilaSheila Broad InstituteMember, Broadie, Moderator, Dev Posts: 3,579 admin

    @ekanterakis‌

    Hello Stathis,

    BP_RESOLUTION will give you every single site (variant and non-variant) as a record, and it will contain the GQ scores.

    GVCF will give you all the sites, but the non-variant sites will be grouped as blocks. The blocks state a start and end position, and they are grouped based on GQ. However, the GQ is not stated in the GVCF. If you run GenotypeGVCFs on the GVCF, you will get the GQ.

    Because BP_RESOLUTION outputs a record for every single site, it does take a lot of time and space. GVCF mode was developed to take less time and space and is more frequently used in genome-wide analysis.

    Please have a look at this article as well: https://www.broadinstitute.org/gatk/guide/article?id=4017

    -Sheila

  • ekanterakisekanterakis UKMember Posts: 18

    Thanks Sheila. I have a follow-up question. Is using BP_RESOLUTION/GVCF going to affect Variant Recalibration downstream? Should I split my variant and non-variant sites and combine them later?
    -Stathis

  • SheilaSheila Broad InstituteMember, Broadie, Moderator, Dev Posts: 3,579 admin

    @ekanterakis‌

    Hi Stathis,

    Because variant recalibration is run on SNPs separately and indels separately, you will use SelectVariants to create new vcfs of only SNPs and only indels. You will then run variant recalibration on each of those vcfs. The non-variant sites will not be included in either of the new vcfs, so you do not need to worry about them.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator, Dev Posts: 3,579 admin

    @ekanterakis‌

    Hi again Stathis,

    Please ignore my above comment if you are going to run VQSR. I was thinking about hard filtering when I wrote that.

    Because the GVCF is just an intermediate, you will run GenotypeGVCFs on all of your GVCFs, and get a file without non-variants. If you would like to retain the non-variants, you can use the
    --includeNonVariantSites argument. If you do use the --includeNonVariantSites argument, it doesn't matter because VQSR will just ignore non-variants, and they will be carried on to the output without modification.

    For VQSR, you do not need to run SelectVariants. You can just run all the steps serially on the same original input file, ending up with everything properly recalibrated in one output file (after the half-recalibrated intermediate). Please read more about VQSR workflow here: https://www.broadinstitute.org/gatk/guide/article?id=2805

    -Sheila

  • ekanterakisekanterakis UKMember Posts: 18

    Hi Sheila, I'm running HaplotypeCaller by chromosome and chunking (50Mb at a time). I'm using -ERC BP_RESOLUTION to get every single position. I would then like to run VariantRecalibrator -mode BOTH (can this be done on each of the chunks separately??) and then ApplyRecalibration (this needs to be done separately for SNPs and Indels, right?). Is this a good approach? Where will my reference positions go after ApplyRecalibration? Will they be in the SNPs file? If I then CombineVariants for all chunks of all chromosomes do I get a complete, recalibrated genome vcf?
    Thank you again for your help.
    -Stathis

  • SheilaSheila Broad InstituteMember, Broadie, Moderator, Dev Posts: 3,579 admin

    @ekanterakis‌

    Hi Stathis,

    You cannot use -BOTH for VQSR at any point, nor can you chunk VQSR jobs.

    You are correct about the ApplyRecalibration step being done separately on SNPs and indels. Your reference positions will be in the final vcf file, but they will not be recalibrated n the process.

    -Sheila

  • FerFer AustriaMember Posts: 29

    Hi, this is my first post in the blog, and it comes only after trying my best to look for a solution to my problem.
    I'm also using HaplotypeCaller in GENOTYPE_GIVEN_ALLELES mode, and I want it to EMIT_ALL_SITES that I've passed it cause it is important for downstream scripts (before used to run UG and got what I wanted, but now I want the benefits of HC). Even after playing around with --interval_padding parameter I still don't get all sites.
    For couple sites that were not emitted I've checked the bam files, and [no surprise] there were no reads spanning the region (I'm working with very low coverage ~1x data for hundreds of individuals).
    Last thing I've tried was to set to 0 -minReadsPerAlignStart just in case, but it was also unsuccessful. What can I do to get ALL passed sites even if they come as ./. as many others already do?
    Alberto

  • SheilaSheila Broad InstituteMember, Broadie, Moderator, Dev Posts: 3,579 admin

    @Fer‌

    Hi Alberto,

    Can you please provide us with what version of GATK you are using and the exact command line you used? Also, can you post some example lines from the VCF showing missing positions?

    Thanks,
    Sheila

  • FerFer AustriaMember Posts: 29
    edited November 2014

    Thanks Sheila,
    I've tried with both versions 3.2 and 3.3.

    java -Xmx6g -jar GenomeAnalysisTK.jar /
    -R $ref /
    -T HaplotypeCaller /
    -I $acc.sort.mkdup.realigned.bam /
    -o $acc.hc.genotyped.vcf /
    --genotyping_mode GENOTYPE_GIVEN_ALLELES /
    -out_mode EMIT_ALL_SITES /
    -stand_call_conf 0.0 /
    -stand_emit_conf 0.0 /
    --interval_padding 200 /
    -minReadsPerAlignStart 0 /
    --alleles %s.%s.hc.snp.vcf /
    -L %s.%s.hc.snp.vcf

    I don't know if this is the example you're asking me for. It comes from the %s.%s.hc.snp.vcf file that provides the alleles of interest. Positions 1420 and 3102 get into the output file, while 2309 and 2310 don't.

    Chr1 1420 . T A 39.17 . AC=2;AF=0.500;AN=4;DP=11;FS=0.000;GQ_MEAN=31.69;GQ_STDDEV=22.03;InbreedingCoeff=0.8131;MQ=60.00;MQ0=0;NCC=0;QD=13.06 GT:AD:DP:GQ:PL 0/0:8,0:8:21:0,21,232 1/1:0,3:3:9:82,9,0

    Chr1 2309 . T A 4209.40 . AC=4;AF=1.00;AN=4;BaseQRankSum=-1.628e+00;ClippingRankSum=0.158;DP=14;FS=3.100;GQ_MEAN=34.00;GQ_STDDEV=13.68;InbreedingCoeff=0.9924;MQ=60.00;MQ0=0;MQRankSum=1.63;NCC=1;QD=26.88;ReadPosRankSum=1.31 GT:AD:DP:GQ:PL 1/1:0,8:8:24:359,24,0 1/1:0,6:6:18:270,18,0

    Chr1 2310 . C A 4209.40 . AC=4;AF=1.00;AN=4;BaseQRankSum=1.10;ClippingRankSum=0.683;DP=14;FS=3.100;GQ_MEAN=34.00;GQ_STDDEV=13.68;InbreedingCoeff=0.9924;MQ=60.00;MQ0=0;MQRankSum=-1.580e-01;NCC=1;QD=28.92;ReadPosRankSum=1.42 GT:AD:DP:GQ:PL 1/1:0,8:8:24:359,24,0 1/1:0,6:6:18:270,18,0

    Chr1 3102 . A G 2245.11 . AC=4;AF=1.00;AN=4;DP=13;FS=0.000;GQ_MEAN=29.69;GQ_STDDEV=17.70;InbreedingCoeff=0.9221;MQ=60.00;MQ0=0;NCC=0;QD=27.05 GT:AD:DP:GQ:PL 1/1:0,11:11:33:320,33,0 1/1:0,2:2:6:57,6,0

    Post edited by Fer on
  • SheilaSheila Broad InstituteMember, Broadie, Moderator, Dev Posts: 3,579 admin
    edited November 2014

    @Fer‌

    Hi Alberto,

    The issue is that Haplotype Caller is not compatible with -out_mode EMIT_ALL_SITES. You must use -ERC BP_RESOLUTION instead. This will give you an intermediate GVCF which you will need to run in GenotypeGVCFs to give you the final vcf. Sorry for the confusion. I will fix this in the documentation asap.

    Please read more about -ERC BP_RESOLUTION and intermediate GVCFs here: https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php#--emitRefConfidence

    https://www.broadinstitute.org/gatk/guide/article?id=4017

    Once you have the output GVCF from running HaplotypeCaller with -ERC BP_RESOLUTION, to get the regular vcf with all sites, you will run GenotypeGVCFs with -allSites. Please read about it here: https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_GenotypeGVCFs.php#--includeNonVariantSites

    I hope this helps.

    -Sheila

    Post edited by Sheila on
  • FerFer AustriaMember Posts: 29

    Thanks, it helped to know. However, I was able to find my way around to get what I originally wanted: the output of GENOTYPE_GIVEN_ALLELES emitting all sites present in the vcf provided in --alleles parameter. I just had to run Combine variants afterwards.
    Alberto

  • tinutinu Member Posts: 48

    I am having issue GATK UnifiedGenotyper in GENOTYPE GIVEN ALLELES mode

     java -Xmx4g -jar GATK-3.3/GenomeAnalysisTK.jar -T UnifiedGenotyper \
     -R  hs37d5_chr.fa -glm BOTH  
    --genotyping_mode GENOTYPE_GIVEN_ALLELES  \
    -alleles:VCF diff-chr21_sort.vcf-L diff-chr21_sort.vcf \
    --output_mode EMIT_ALL_SITES -U LENIENT_VCF_PROCESSING \
    -I sample1.bam -o diff-chr21_sort.gga.vcf
    

    My input file (-alleles) file has 3672 variants.
    But GGA called VCF just had 2583 variants.

    Took the variants not in the outputfile (3672 - 2583) and ran UnifiedGenotyper in GGA mode and saw that none of those variants were outputted and these were the GATK messages

     16:18:53,014 HelpFormatter - Date/Time: 2015/05/08 16:18:52
    INFO  16:18:53,015 HelpFormatter - --------------------------------------------------------------------------------
    INFO  16:18:53,016 HelpFormatter - --------------------------------------------------------------------------------
    INFO  16:18:53,428 GenomeAnalysisEngine - Strictness is SILENT
    INFO  16:18:53,651 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250
    INFO  16:18:53,664 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
    INFO  16:18:53,732 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.07
    INFO  16:18:54,356 IntervalUtils - Processing 1494 bp from intervals
    INFO  16:18:54,454 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files
    INFO  16:18:54,603 GenomeAnalysisEngine - Done preparing for traversal
    INFO  16:18:54,604 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
    INFO  16:18:54,604 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining
    INFO  16:18:54,605 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime
    INFO  16:18:54,653 StrandBiasTest - SAM/BAM data was found. Attempting to use read data to calculate strand bias annotations values.
    INFO  16:18:54,653 StrandBiasTest - SAM/BAM data was found. Attempting to use read data to calculate strand bias annotations values.
    INFO  16:18:57,585 ProgressMeter -            done     74850.0     2.0 s      39.0 s       99.9%     2.0 s       0.0 s
    INFO  16:18:57,585 ProgressMeter - Total runtime 2.98 secs, 0.05 min, 0.00 hours
    INFO  16:18:57,590 MicroScheduler - 109 reads were filtered out during the traversal out of approximately 4554 total reads (2.39%)
    INFO  16:18:57,590 MicroScheduler -   -> 93 reads (2.04% of total) failing BadMateFilter
    INFO  16:18:57,591 MicroScheduler -   -> 0 reads (0.00% of total) failing DuplicateReadFilter
    INFO  16:18:57,591 MicroScheduler -   -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
    INFO  16:18:57,592 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter
    INFO  16:18:57,592 MicroScheduler -   -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter
    INFO  16:18:57,593 MicroScheduler -   -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter
    INFO  16:18:57,593 MicroScheduler -   -> 16 reads (0.35% of total) failing UnmappedReadFilter
    WARN  16:18:59,117 RestStorageService - Error Response: PUT '/qheODGuSqbELnioHLUk9v353xAJ1rwcd.report.xml.gz' -- ResponseCode: 403, ResponseStatus: Forbidden, Request Headers: [Content-Length: 397, Content-MD5: 2p38VGbP3uPfO8rDLw39kA==, Content-Type: application/octet-stream, x-amz-meta-md5-hash: da9dfc5466cfdee3df3bcac32f0dfd90, Date: Fri, 08 May 2015 21:18:58 GMT, Authorization: AWS AKIAI22FBBJ37D5X62OQ:+qZ2PaFrK/RE6siH04K2xWRV2fs=, User-Agent: JetS3t/0.8.1 (Linux/2.6.32-431.5.1.el6.x86_64; amd64; en; JVM 1.7.0_51), Host: broad.gsa.gatk.run.reports.s3.amazonaws.com, Expect: 100-continue], Response Headers: [x-amz-request-id: A5F6D5659C08C440, x-amz-id-2: mBCoflGPYIDx+QRSvK4rwnn84Rp5zx3B7sxooBsNqtr4j72R+HbJTAZlyvFT5DkU7EiAHbrqrFo=, Content-Type: application/xml, Transfer-Encoding: chunked, Date: Fri, 08 May 2015 22:07:17 GMT, Connection: close, Server: AmazonS3]
    WARN  16:18:59,179 RestStorageService - Adjusted time offset in response to RequestTimeTooSkewed error. Local machine and S3 server disagree on the time by approximately 2897 seconds. Retrying connection.
    INFO  16:18:59,389 GATKRunReport - Uploaded run statistics report to AWS S3
    

    I had previously used Unified Genotyper and it did give me all variants in the output file. Not sure what is happening now.

  • tommycarstensentommycarstensen United KingdomMember Posts: 400 ✭✭✭

    @tinu It should not matter, but can you try with -stand_call_conf 0 and -stand_emit_conf 0? Any reason you are using -U LENIENT_VCF_PROCESSING? Can you print an example of a variant, which fails to be called?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator, Dev Posts: 3,579 admin

    @tinu
    Hi,

    This thread may help: http://gatkforums.broadinstitute.org/discussion/5520/is-there-a-conflict-between-emit-all-confident-sites-and-genotype-given-alleles#latest

    Also, Tommy @tommycarstensen is correct about the vcf maybe having errors. Please try validating your vcf.

    Thanks,
    Sheila

  • tinutinu Member Posts: 48

    @tommycarstensen I tried using both-stand_call_conf 0 and -stand_emit_conf 0. Still the same issue, not all variants given as -alleles is there in the GGA output.

    I am pasting few variants which were not present in the output, but were part of the input file

    chr21   10906855        .       G       A       167.02  VQSRTrancheSNP99.60to99.80      AC=1;AF=1.330e-05;AN=75212;BaseQRankSum=-1.271e+00;CCC=75212;ClippingRankSum=2.20;DP=1989849;FS=0.000;GQ_MEAN=89.17;GQ_STDDEV=27.18;HWP=1.0000;InbreedingCoeff=0.0004;MLEAC=1;MLEAF=1.330e-05;MQ=53.65;MQ0=0;MQRankSum=-3.030e+00;NCC=1;NEGATIVE_TRAIN_SITE;QD=3.21;ReadPosRankSum=-2.096e+00;VQSLOD=-6.788e+00;culprit=QD;set=FilteredInAll
    chr21   10906856        .       G       T       73.02   VQSRTrancheSNP99.60to99.80      AC=1;AF=1.330e-05;AN=75212;BaseQRankSum=-1.699e+00;CCC=75212;ClippingRankSum=-5.190e-01;DP=1989828;FS=2.374;GQ_MEAN=89.22;GQ_STDDEV=27.09;HWP=1.0000;InbreedingCoeff=0.0008;MLEAC=1;MLEAF=1.330e-05;MQ=58.36;MQ0=0;MQRankSum=0.411;NCC=1;NEGATIVE_TRAIN_SITE;QD=1.12;ReadPosRankSum=-3.040e-01;VQSLOD=-2.658e+00;culprit=QD;set=FilteredInAll
    chr21   10906858        .       T       G,C     14393.03        VQSRTrancheSNP99.60to99.80      AC=164,1;AF=2.181e-03,1.330e-05;AN=75212;BaseQRankSum=-5.703e+00;CCC=75212;ClippingRankSum=2.51;DP=1952326;FS=17.557;GQ_MEAN=85.70;GQ_STDDEV=30.69;HWP=1.0000;InbreedingCoeff=0.0003;MLEAC=144,1;MLEAF=1.915e-03,1.330e-05;MQ=54.45;MQ0=0;MQRankSum=2.53;NCC=1;NEGATIVE_TRAIN_SITE;QD=1.11;ReadPosRankSum=1.19;VQSLOD=-6.531e+00;culprit=FS;set=FilteredInAll
    chr21   10906864        .       A       C       361.02  VQSRTrancheSNP99.60to99.80      AC=1;AF=1.330e-05;AN=75212;BaseQRankSum=0.825;CCC=75212;ClippingRankSum=1.22;DP=1945929;FS=3.564;GQ_MEAN=88.71;GQ_STDDEV=26.94;HWP=1.0000;InbreedingCoeff=0.0009;MLEAC=1;MLEAF=1.330e-05;MQ=57.04;MQ0=0;MQRankSum=0.332;NCC=1;NEGATIVE_TRAIN_SITE;QD=2.54;ReadPosRankSum=0.086;VQSLOD=-1.969e+00;culprit=QD;set=FilteredInAll
    chr21   10906865        .       G       C       160.02  VQSRTrancheSNP99.60to99.80      AC=1;AF=1.330e-05;AN=75212;BaseQRankSum=-1.920e+00;CCC=75212;ClippingRankSum=-1.220e-01;DP=1946148;FS=0.000;GQ_MEAN=88.69;GQ_STDDEV=26.91;HWP=1.0000;InbreedingCoeff=0.0007;MLEAC=1;MLEAF=1.330e-05;MQ=58.27;MQ0=0;MQRankSum=-2.420e-01;NCC=1;NEGATIVE_TRAIN_SITE;QD=0.96;ReadPosRankSum=-5.970e-01;VQSLOD=-3.119e+00;culprit=QD;set=FilteredInAll
    chr21   10906866        .       A       T       306.28  VQSRTrancheSNP99.60to99.80      AC=2;AF=2.659e-05;AN=75212;BaseQRankSum=-5.560e-01;CCC=75212;ClippingRankSum=0.169;DP=1946211;FS=0.000;GQ_MEAN=88.73;GQ_STDDEV=26.87;HWP=1.0000;InbreedingCoeff=0.0008;MLEAC=2;MLEAF=2.659e-05;MQ=60.00;MQ0=0;MQRankSum=2.40;NCC=1;NEGATIVE_TRAIN_SITE;QD=1.76;ReadPosRankSum=0.960;VQSLOD=-2.343e+00;culprit=QD;set=FilteredInAll
    chr21   10906868        .       CATA    C,TATA  233.71  VQSRTrancheINDEL99.00to99.50    AC=1,1;AF=1.330e-05,1.330e-05;AN=75212;BaseQRankSum=-1.123e+00;CCC=75212;ClippingRankSum=-1.058e+00;DP=1946484;FS=1.668;GQ_MEAN=88.72;GQ_STDDEV=26.87;HWP=1.0000;InbreedingCoeff=0.0009;MLEAC=1,1;MLEAF=1.330e-05,1.330e-05;MQ=56.33;MQ0=0;MQRankSum=0.770;NCC=1;QD=2.57;ReadPosRankSum=0.963;VQSLOD=-5.935e-01;culprit=QD;set=FilteredInAll
    chr21   10906870        .       T       C       667.02  VQSRTrancheSNP99.60to99.80      AC=1;AF=1.330e-05;AN=75212;BaseQRankSum=1.63;CCC=75212;ClippingRankSum=-5.080e-01;DP=1946781;FS=0.000;GQ_MEAN=88.76;GQ_STDDEV=27.02;HWP=1.0000;InbreedingCoeff=0.0009;MLEAC=1;MLEAF=1.330e-05;MQ=45.95;MQ0=0;MQRankSum=0.322;NCC=1;NEGATIVE_TRAIN_SITE;QD=5.02;ReadPosRankSum=1.19;VQSLOD=-1.779e+00;culprit=QD;set=FilteredInAll
    chr21   10906872        .       A       G       531.02  VQSRTrancheSNP99.60to99.80      AC=1;AF=1.330e-05;AN=75212;BaseQRankSum=-3.022e+00;CCC=75212;ClippingRankSum=-1.850e-01;DP=1947078;FS=0.000;GQ_MEAN=88.77;GQ_STDDEV=26.92;HWP=1.0000;InbreedingCoeff=0.0007;MLEAC=1;MLEAF=1.330e-05;MQ=59.56;MQ0=0;MQRankSum=-1.247e+00;NCC=1;NEGATIVE_TRAIN_SITE;QD=2.75;ReadPosRankSum=2.07;VQSLOD=-3.958e+00;culprit=QD;set=FilteredInAll
    chr21   10906875        .       CT      AT,TT,GT,C      3201.09 VQSRTrancheINDEL99.00to99.50    AC=10,1,1,1;AF=1.330e-04,1.330e-05,1.330e-05,1.330e-05;AN=75212;BaseQRankSum=-1.916e+00;CCC=75212;ClippingRankSum=-6.650e-01;DP=1947549;FS=0.880;GQ_MEAN=88.83;GQ_STDDEV=27.25;HWP=1.0000;InbreedingCoeff=0.0003;MLEAC=10,1,1,1;MLEAF=1.330e-04,1.330e-05,1.330e-05,1.330e-05;MQ=58.37;MQ0=0;MQRankSum=0.371;NCC=1;QD=2.83;ReadPosRankSum=1.06;VQSLOD=-6.668e-01;culprit=InbreedingCoeff;set=FilteredInAll
    
  • pdexheimerpdexheimer Member, Dev Posts: 534 ✭✭✭✭

    Aha! They're all filtered, so the engine never sees them to genotype those alleles. I don't believe there's an option to ignore filters on -alleles (but check, there might be!), but there is an option to strip filters in SelectVariants - so you should be able to remove the filters first, and then send it UG

  • tinutinu Member Posts: 48
    edited May 2015

    @pdexheimer I tried to empty the INFO field by putting the value of INFO field as '.' and still those variants were not in the output.

    chr21   10906855        .       G       A       167.02  VQSRTrancheSNP99.60to99.80      .
    chr21   10906856        .       G       T       73.02   VQSRTrancheSNP99.60to99.80      .
    chr21   10906858        .       T       G,C     14393.03        VQSRTrancheSNP99.60to99.80      .
    chr21   10906864        .       A       C       361.02  VQSRTrancheSNP99.60to99.80      .
    chr21   10906865        .       G       C       160.02  VQSRTrancheSNP99.60to99.80      .
    chr21   10906866        .       A       T       306.28  VQSRTrancheSNP99.60to99.80      .
    chr21   10906868        .       CATA    C,TATA  233.71  VQSRTrancheINDEL99.00to99.50    .
    chr21   10906870        .       T       C       667.02  VQSRTrancheSNP99.60to99.80      .
    chr21   10906872        .       A       G       531.02  VQSRTrancheSNP99.60to99.80      .
    chr21   10906875        .       CT      AT,TT,GT,C      3201.09 VQSRTrancheINDEL99.00to99.50    .
    
    Post edited by tinu on
  • pdexheimerpdexheimer Member, Dev Posts: 534 ✭✭✭✭

    Well, that makes sense. The INFO field has nothing to do with filters. Try clearing out the FILTER column

  • tinutinu Member Posts: 48

    @Sheila According to your thread, if I give just Genotype_Given_Alleles, I just get 254 variants input and
    with EMIT_ALL_CONFIDENT_SITES the output has 2583 variants
    with EMIT_ALl_SITES, the ouput has 2533 variants

    My input has 3672 variants

  • tinutinu Member Posts: 48

    @pdexheimer OK so now I made the FILTER and INFO field as '.' and tried running again and now my input and output have the same number of variants.

    Thanks for helping out.

    Could you tell why does the FILTER field values affect the GENOTYPE_GIVEN_ALLELE mode of variant calling

  • pdexheimerpdexheimer Member, Dev Posts: 534 ✭✭✭✭

    By default, the engine ignores filtered variants in vcfs. After all, if they were confident sites, they wouldn't be filtered! Most of the walkers that take vcfs as input have an option to ignore the filters, but I don't think such an option exists for the interval processing unit

  • zzqzzq ChinaMember Posts: 54
    edited March 25

    Dear all, @Geraldine_VdAuwera @Sheila
    I am running HaplotypeCaller to call genotype for each confident site including reference homozygous using following command,

    java -Xmx100g  -jar GenomeAnalysisTK-3.5.jar \
            -R reference.fa \
            -T HaplotypeCaller -nct 8 \
            -stand_call_conf 30 \
            -stand_emit_conf 30 \
            --output_mode EMIT_ALL_CONFIDENT_SITES \
            -I sample1.bam \
            -I sample2.bam \
            -I sample3.bam \
            -o output.vcf.gz

    But from the result, it just not give me an intensive site, so I want to know, why so many positions are not included in the result, do those sites have a messy mapping ? The size of chromosome 1 is 250M, and the number of confident sites including in the result is about 10M, How about the remaining sites which have 240M ?

    Many thanks!

    Post edited by zzq on
  • SheilaSheila Broad InstituteMember, Broadie, Moderator, Dev Posts: 3,579 admin
    edited March 25

    @zzq
    Hi,

    I think the issue is that --output_mode EMIT_ALL_CONFIDENT_SITES does not work in HaplotypeCaller. Your best bet is to run HaplotypeCaller wirh -ERC BP_RESOLUTION on each of the samples, then GenotypeGVCFs with -allSites on those output GVCFs.

    -Sheila

    P.S. To save space, you can run with -ERC GVCF instead of -ERC BP_RESOLUTION.

    Post edited by Sheila on
  • michelmichel Member Posts: 13

    Dear GATK_developers,

    Could you please update HaplotypeCallers docs and remove the --output_mode options EMIT_ALL_SITES and EMIT_ALL_CONFIDENT_SITES as they are no longer available for this tool?
    Maybe even but some hints there that -ERC BP_RESOLUTION could be used instead would be great!

    It would have saved me some time and i guess others might run into the same trap.
    Thank you,
    Michel

    Issue · Github
    by Sheila

    Issue Number
    1115
    State
    open
    Last Updated
    Assignee
    Array
    Milestone
    Array
  • SheilaSheila Broad InstituteMember, Broadie, Moderator, Dev Posts: 3,579 admin

    @michel
    Hi Michel,

    Yes, sorry for the confusion. I will make a note to update that.

    -Sheila

  • LisamLisam FRANCEMember Posts: 3

    Hi,

    As I had lost a good number of variants with HaplotypeCaller, I followed your advice and added the -ERC BP_RESOLUTION argument to get every single site. However, I am now struck with a problem and hope you can help me resolve it. For the position of chr10:102762256 (rs752974; which I could not detect without -ERC BP_RESOLUTION), you can find in the vcf file :

    #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT M977
    chr10 102762256 . C <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:0,19:19:0:0,0,0

    I don't understand why there is no value for QUAL and why the genotype is 0/0 when there is a coverage of 19 reads all with a non-ref base. I checked the base qualities of the non-reference bases on the input BAM with IGV and they range from 18 to 30. Are those values too low for the bases to be considered in the calculation of the genotype? What's more, the mapping qualities of the reads on IGV for the bam input file were 50, and while it does look good, I wasn't able to find anywhere if it was actually also too low to be taken into account and if the reads were discarded because of it.

    You can see the screenshot of the input BAM on IGV : image

    Another thing is, when I add the -minDanglingBranchLength 1, I obtain the genotype 1/1 but don't understand some results :

    chr10 102762256 . C T,<NON_REF> 476.77 . DP=19;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQ=47500.00 GT:AD:DP:GQ:PL:SB 1/1:0,19,0:19:57:505,57,0,505,57,505:0,0,18,1

    I get three values for AD (0,19,0) and can't understand why.

    Thanks in advance,

    Lisa

    igv_inputbam_chr10_102762256.png
    285 x 475 - 4K
  • shleeshlee CambridgeMember, Broadie, Moderator, Dev Posts: 137 admin
    edited July 27

    Hi @Lisam,

    Your raw BAM indicates that this is a border region where there are no reads on one side and then suddenly reads on the other. By default HaplotypeCaller needs some amount of reference bases at the 5' start of an assembly graph. I think this is why changing the -minDanglingBranthLength parameter is changing your results.

    The three values for AD refer to the allele depth for reference allele (C), first alt allele (T), and second alt allele <NON_REF>. So you have 19 reads that support the first alt allele (T).

    What will help in interpreting your calls is HaplotypeCaller's -bamout for the region +/- -minDanglingBranthLength 1 option. View your raw BAM side-by-side with the BAMOUTs. Does one of your bamouts show trimmed reads that exclude position 102762256?

    Post edited by shlee on
  • LisamLisam FRANCEMember Posts: 3

    Thanks for you answer @shlee and sorry for answering only today!

    I compared the bam input and bam output with the argument -minDanglingBranchLength 1 on IGV and got the following picture (the bam input is the one on top):

    image

    I also compared the bamout without the -minDanglingBranchLength 1 argument (the bam input is also the one on top) and no position is detected (I restricted the bamout to the region +/- 150 bp around the position so I would get it faster, but I may have gotten reads further on the chromosome if I had for instance asked for the bamout of the entire genome - I'll try that if you want). You can see it here :

    image

    I hope it can help!

    bam_input_chr10_102762256_and_output_without_minD.png
    247 x 602 - 3K
  • SheilaSheila Broad InstituteMember, Broadie, Moderator, Dev Posts: 3,579 admin

    @Lisam
    Hi,

    Can you post the entire region (+/- 150 bases) around the site of interest? It looks like something weird is going on. Notice the red reads which suggest potential structural variation.

    -Sheila

  • LisamLisam FRANCEMember Posts: 3

    Hi @Sheila,

    You can find here the igv screenshot of the bamout file (with the minDanglingBranchLength -1 argument, the bamout without the argument doesn't have any read in the region) around the region of chr10:102762256 (+/-150 bp) :

    image

    And here the screenshot of the bam input also around the region of chr10:102762256 (+/-150 bp) which is the beginning of an exon :

    image

    I also have an unrelated question. When I checked the reads in the bamout file (with the minDanglingBranchLength -1 argument )for chr10:102762256, those that were bearing the ALT base T had a "flag" HC:i:-344638776, and I am wondering what it means.

    Thanks in advance!

    baminput_chr10_102762256_M977_plus_minus_150_bp.png
    925 x 510 - 12K
  • shleeshlee CambridgeMember, Broadie, Moderator, Dev Posts: 137 admin

    @Lisam,

    One of the team will correct me if I am mistaken. In the meanwhile, here is my take on what is going on for your locus. I think what you want is an explanation of why you observe the no call versus call for the alternate allele at the 5' start of your locus.

    Ok, given you've observed the following:

    [T]he bamout without the [minDanglingBranchLength 1] doesn't have any read in the region) around the region....

    This means the locus didn't pass one of the early selection processes and so is not even considered by HaplotypeCaller. Hence the empty locus in the bamout. The assembly graph's pruning of dangling branches, which requires a default 4 kmers to attempt recovery, removes the branch containing your allele from the graph. This explains your no call for the region. Let's get deeper into this.

    First, take a look at this recent workshop presentation. In your case, slides 6–9 are of interest. We can answer any questions you have for the slides. I'm showing slide 8 below where you should see two dangling branches. One in the GaCTC box at the left and the second is the c-A-C-T-A branch at the bottom. Remember the single base representation in this graph is a simplification that only represents the last base for what are really kmers, in this case kmers of five bases. The reference is also a path through the graph (blue arrows). What HaplotypeCaller wants is a graph that starts with one kmer and ends in one kmer--in other words a graph where kmers emerge and converge back on to reference kmers. Ah, also, remember that HaplotypeCaller will use soft-clipped bases.

    image

    As the slide shows, these branches translate to candidate haplotypes. However, HaplotypeCaller subsequently prunes haplotypes with dangling branches with less than 4 kmers. HaplotypeCaller attempts rescue of branches with four or more kmers in the dangling branch. So we prune the left dangling branch immediately but not yet the bottom dangling branch, which goes on to being rescued. Your locus, at 3 bases from the start of your read coverage, is not rescued by the default 4 kmer setting. This locus then contains no alternate alleles and so the locus is no longer a contender for HaplotypeCaller. When you put in minDanglingBranchLength 1 or 2 you change the minimum kmers needed. This allows rescue attempts of your 5' dangling head. And this is why your allele is then called.

    Let's go on to your last question about how the reads are labeled. The BAMOUT contains your reads plus ARTIFICIAL HAPLOTYPES that represent every combination of alternate alleles using paths in the assembly graph. This means artifical haplotypes also represent chimeric haplotypes that may have absolutely no support from phasing of reads. This in turn means you should be certain NOT to include these artificial haplotypes in eyeballing/interpreting a locus using the BAMOUT. In IGV, if you group your reads by RG tag, you'll see two of the reads in one group, the artificial hapotypes, and the rest of your reads in another, the emitted reads, divided by a line that IGV draws across the track. Then you can color your reads by HC tag and see which reads support which of the artifical haplotypes. Your HC:i:-344638776 tag is labeling one such haplotype group. Sorting by HC tag will subgroup the reads by color.

    I hope all this is helpful. Let me know if we can clarify anything.

    Slide09.png
    1440 x 1080 - 219K
Sign In or Register to comment.