Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Two validated variants missed by HaplotypeCaller using MIP data (amplicon like data)

MaartjevdVorstMaartjevdVorst NijmegenMember
edited October 2016 in Ask the GATK team

Dear GATK,

We are using MIPs (amplicon like) data to analyze the variants for certain genes. However, in two independent samples two validated variants were missed by the HaplotypeCaller. We were wondering if you have any idea why these variants were not called?

I've used the latest version of GATK (3.6) and the two commands we performed are:
--filter_mismatching_base_and_quals -R hs_ref_GRCh37.p5_all_contigs.fa -I sample1.sorted.bam -T HaplotypeCaller --emitRefConfidence GVCF -L targets.bed --dbsnp dbsnp_137.hg19.vcf -rf BadCigar -stand_call_conf 30.0 -stand_emit_conf 30.0 -nct 1 -o sample1_haplotypecaller.g.vcf
--filter_mismatching_base_and_quals -R hs_ref_GRCh37.p5_all_contigs.fa -I sample2.sorted.bam -T HaplotypeCaller --emitRefConfidence GVCF -L targets.bed --dbsnp dbsnp_137.hg19.vcf -rf BadCigar -stand_call_conf 30.0 -stand_emit_conf 30.0 -nct 1 -o sample2_haplotypecaller.g.vcf

Attached you will find two pictures of the used bam files. The mapping quality of the variant-reads look similar compared to the reference-reads(~60) as well as the base phred quality (~36). I've tried also many other settings/arguments for example by lowering the minimum phred-scaled confidence threshold at which variants should be called and the minimum phred-scaled confidence threshold at which variants should be emitted. Nothing worked to call the variants, However, if I use a smaller target region I am able to call the variant located on chr8.

The output of the GVCF gave:
chr14 31355353 . C . . END=31355353 GT:DP:GQ:MIN_DP:PL 0/0:987:0:987:0,0,11170
and
chr8 117861187 . G . . END=117861187 GT:DP:GQ:MIN_DP:PL 0/0:1253:0:1253:0,0,20903

Thank you very much in advance!
Kind regards,

Maartje

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @MaartjevdVorst
    Hi Maartje,

    Have a look at this thread.

    I hope it helps.

    -Sheila

  • MaartjevdVorstMaartjevdVorst NijmegenMember

    Hi Sheila,

    Thanks for your reply. I managed to call the variant on chromosome 8 by using --kmerSize 18. However, the variant located on chromosome 14 is still not called using different kmerSize values. Do you think that the reason can be that it is located on the end of the reads?

    Maartje

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @MaartjevdVorst
    Hi Maartje,

    Have a look at this document for more tips.

    Also, can you please post an IGV screenshot of the original BAM file at the position (please include about 100 bases before and after the site of interest). Please also post the bamout file.

    Thanks,
    Sheila

  • MaartjevdVorstMaartjevdVorst NijmegenMember

    Hi Sheila,

    I had a look at your document for more tips, but still I cannot find a good reason why the variant is not called. Qualities are looking fine in the original bam file and I don't see multiple alleles. Also the variants isn't a systematic bias because it's validated with Sanger and I also tried different --kmerSizes. I was not able to attach the bamout.bam (no permission), so I made a IGV screenshot of it, see attachment: chr14_bamout.png. And I also attached an IGV screenshot of the original BAM file (chr14_originalbam.png). You will notice that the region where the variant is located doesn't contain any mapped reads anymore in de bamout. So this is probably the reason I think why it's not called, but why are all those reads filtered out?

    Thank you for all your help!
    Maartje

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @MaartjevdVorst
    Hi Maartje,

    The major issue is that the variant is at the ends of all the reads. HaplotypeCaller will not have much confidence in the variants that are only present at the ends of reads, as sequencers tend to make mistakes there. You can try the tips in section 4 of this article because the region right now is not getting called as an active region.

    Also, you can try running UnifiedGenotype on that region and see if a call is made there.

    -Sheila

  • MaartjevdVorstMaartjevdVorst NijmegenMember

    Thank you very much!

  • KleinertpKleinertp Member
    edited April 11
    Hey GATK-Staff,
    I have a similar question regarding amplicon like data generated by MIPs:
    I use GATK4.0.4.0 Haplotypecaller in GVCF mode.

    gatk --java-options "-Xmx8G" HaplotypeCaller --max-reads-per-alignment-start 1000 --kmer-size 10 --kmer-size 11 --kmer-size 12 --kmer-size 13 --kmer-size 14 --kmer-size 15 --kmer-size 25 --kmer-size 35 --max-num-haplotypes-in-population 512 --standard-min-confidence-threshold-for-calling 20 --sample-name {params.plate} -ERC GVCF -R {input.fasta} -I {input.bam} -L {input.targets} -O {output.vcfgz}

    with CombineGVCFs and GenotypeGVCFs
    I start missing variants when the --max-read-per-alignment-start is set higher than 50 that UnifiedGenotyper calls (see IGV shot). I already played around with kmer-size and max-num-haplotypes-in-population to recover some missing variants but not all.
    Do you have any Idea what happens here?

    Philip
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    HI @Kleinertp

    --max-reads-per-alignment-start is helpful when the genome has a few hotspots of extremely high coverage, due to mapping error for the most part, where to avoid spending an inordinate amount of compute on these few regions we truncated the coverage.

    However, this behavior should be turned off(by setting --max-reads-per-alignment-start to 0), for deep sequencing data as yours, when the coverage is uniformly high and you want to use that depth to discover low-AF variants.

  • KleinertpKleinertp Member
    edited April 25
    Hey,

    Actually particularly putting --max-reads-per-alignment-start to 0 makes me not call 20 % of my variants even though no other parameter is changed. --max-reads-per-alignment-start 50 is the most successful setting.
    Do you have any explanation for that?

    Philip
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @Kleinertp

    Can you please post a few variant records that are filtered by setting -max-reads-per-alignment-start to 0 and retained when --max-reads-per-alignment-start is set to 50

  • KleinertpKleinertp Member
    These Variants are NOT called when --max-reads-per-alignment-start is 0 but are called when --max-reads-per-alignment-start is 50
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited May 6

    Hi @Kleinertp

    Actually particularly putting --max-reads-per-alignment-start to 0 makes me not call 20 % of my variants even though no other parameter is changed. --max-reads-per-alignment-start 50 is the most successful setting. Do you have any explanation for that?

    That happens because, when there are large number of reads and you don't down sample, there are many haplotypes created and it causes ambiguity hence lower number of variants are called.

    Also try to use default kmer sizes.

  • KleinertpKleinertp Member

    I have another question that I could not get solved:
    Comparing my variantcallset obtained with GATK3 I realized that GATK4 does not call some variants that are imbalanced. Meaning: AD values like this: 12,716

    I use GATK4.0.4.0 Haplotypecaller in GVCF mode.

    gatk --java-options "-Xmx8G" HaplotypeCaller --max-reads-per-alignment-start 1000 --kmer-size 10 --kmer-size 11 --kmer-size 12 --kmer-size 13 --kmer-size 14 --kmer-size 15 --kmer-size 25 --kmer-size 35 --max-num-haplotypes-in-population 512 --standard-min-confidence-threshold-for-calling 20 --sample-name {params.plate} -ERC GVCF -R {input.fasta} -I {input.bam} -L {input.targets} -O {output.vcfgz}

    with CombineGVCFs and GenotypeGVCFs

    Is there a way to NOT filter these "imbalanced" variants in GATK4? I couldn't find any parameters, but maybe I just missed it.

    Best,

    Philip

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @Kleinertp

    Could you please post example vcf records that you do not want filtered to get a better sense of the question here.

  • KleinertpKleinertp Member
    edited August 5

    Here are 4 examples of variants that get filtered in GATK4 using above mentioned settings and are therefore NOT called.
    I do believe these variants are real though and should be called.
    Any idea why these variants are not being considered?

    fileformat=VCFv4.2

    source=CombineGVCFs
    source=GenotypeGVCFs
    source=HaplotypeCaller
    CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Plate
    X 1 . T C 48.20 . AC=1;AF=1.316e-03;AN=760;BaseQRankSum=-1.500e+00;DP=19798;ExcessHet=3.0103;FS=30.821;InbreedingCoeff=-0.0013;MLEAC=1;MLEAF=1.316e-03;MQ=60.00;MQRankSum=0.00;QD=0.60;ReadPosRankSum=-3.450e+00;SOR=4.163 GT:AD:DP:GQ:PL 0/1:73,7:80:74:74,0,4454
    X 2 . G A 101.24 . AC=1;AF=1.316e-03;AN=760;BaseQRankSum=-3.180e+00;DP=19412;ExcessHet=3.0103;FS=35.753;InbreedingCoeff=-0.0026;MLEAC=1;MLEAF=1.316e-03;MQ=60.00;MQRankSum=0.00;QD=1.10;ReadPosRankSum=-3.549e+00;SOR=4.800 GT:AD:DP:GQ:PL 0/1:82,10:92:99:127,0,4101
    X 3 . C T 184.20 . AC=1;AF=1.316e-03;AN=760;BaseQRankSum=-7.490e-01;DP=35150;ExcessHet=3.0103;FS=0.000;InbreedingCoeff=-0.0013;MLEAC=1;MLEAF=1.316e-03;MQ=60.00;MQRankSum=0.00;QD=2.00;ReadPosRankSum=3.76;SOR=0.010 GT:AD:DP:GQ:PL 0/1:81,11:92:99:210,0,4898
    X 4 . A G 141.19 . AC=1;AF=1.312e-03;AN=762;BaseQRankSum=1.03;DP=19349;ExcessHet=3.0103;FS=0.000;InbreedingCoeff=-0.0017;MLEAC=1;MLEAF=1.312e-03;MQ=60.00;MQRankSum=0.00;QD=1.50;ReadPosRankSum=3.48;SOR=0.017 GT:AD:DP:GQ:PL0/1:84,10:94:99:167,0,5101

    Best,
    Philip

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @Kleinertp

    What makes you say that these variants are filtered due to allele imbalance?

  • KleinertpKleinertp Member

    That was just a guess.
    I don't know why they are filtered. I just know that they are :)

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
Sign In or Register to comment.