Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Force skip INDELs in GATK output

Dear GATK, I ma using GATK4, my syntax is:

gatk HaplotypeCaller -R $reference -I in.bam -ERC BP_RESOLUTION --max-alternate-alleles 1 -O out.raw.snps.g.vcf -L $SNP_TARGET

In some regions i have curious output (also compare with another variant callers + IGV).

Result from GATK in a few samples is:

SNP target is:

chr1 152145740 152145741 rs12144907

Output is:

chr1 152145741 . G GGAGAGC,<NON_REF> 3985.64 . BaseQRankSum=-1.051;DP=317;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=-0.789;RAW_MQandDP=1131778,317;ReadPosRankSum=1.158 GT:AD:DP:GQ:PL:SB0/1:138,172,0:310:99:3993,0,3067,4406,3603,8009:76,62,98,74

But in IGV I can see genotype GC (total count is 406, C is 208 and G is 197). I can see a few insert bases (called by GATK output GAGAGC), but only maybe in 3 reads (note: I do not downsampling BAM in IGV).

Result from bcftools is:

chr1 152145741 . G C,A,<*> 0 . DP=381;I16=102,28,110,23,5898,308852,6254,334182,7562,446018,7880,470048,2518,56072,2784,64140;QS=0.48217,0.514629,0.00320142,0;VDB=2.97088e-34;SGB=-0.693147;RPB=0.0605761;MQB=0.804333;MQSB=0.764982;BQB=0.578277;MQ0F=0 PL 255,0,255,255,255,255,255,255,255,255

Which seems to be correct to my observation in IGV. Note - I need parameter -ERC BP_RESOLUTION.

Any idea how to set up GATK to avoid this weird SNP calling in this region (and few others)? Maybe force skip insertion?

Thank you for any help and reply!

Best,

Paul.

Answers

  • bshifawbshifaw Member, Broadie, Moderator admin

    Are you looking for the following type of argument?

    --pcr-indel-model
    The PCR indel model to use
    When calculating the likelihood of variants, we can try to correct for PCR errors that cause indel artifacts. The correction is based on the reference context, and acts specifically around repetitive sequences that tend to cause PCR errors). The variant likelihoods are penalized in increasing scale as the context around a putative indel is more repetitive (e.g. long homopolymer). The correction can be disabling by specifying '-pcrModel NONE'; in that case the default base insertion/deletion qualities will be used (or taken from the read if generated through the BaseRecalibrator). VERY IMPORTANT: when using PCR-free sequencing data we definitely recommend setting this argument to NONE.

    The --pcr-indel-model argument is an enumerated type (PCRErrorModel), which can have one of the following values:

    NONE
    no specialized PCR error model will be applied; if base insertion/deletion qualities are present they will be used
    HOSTILE
    a most aggressive model will be applied that sacrifices true positives in order to remove more false positives
    AGGRESSIVE
    a more aggressive model will be applied that sacrifices true positives in order to remove more false positives
    CONSERVATIVE
    a less aggressive model will be applied that tries to maintain a high true positive rate at the expense of allowing more false positives

  • zgtmanzgtman Member

    Hello @bshifaw , thank you for reply. I tested to add parameter --pcr-indel-model to my workflow, but result is the same (mentioned above). Any other idea?

  • bshifawbshifaw Member, Broadie, Moderator admin

    Are you looking at the bamout in IGV produced by HC, as mentioned here? Also mind providing a screenshot.

    There are a few more knobs you can turn to decrease sensitivity, they are listed in the HC tool documentation along with their description (e.g. --standard-min-confidence-threshold-for-calling). Search for keywords like sensitivity and threshold and most of them should be highlighted.

Sign In or Register to comment.