We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

GATK v4.1.2.0 AD 0,0 but DP >0 for some loci (pooled sequencing, ploidy =20 , non-model species)

cmtcmt Seattle WAMember

Hello!

I have used HaplotypeCaller and Joint Genotyping on 12 groups of pooled whole genome sequences, using a ploidy of 20. I based my pipeline on the Best Practices, but it was modified based on my study's experimental design and the limitations of the reference genome available for my species.

There were only minor issues (I think) and I eventually I filtered the SNPs that discovered for quality parameters with all 12 pools together:

gatk --java-options "-Xmx${command_mem}m" VariantFiltration \
     -R ${ref_fasta} \
     -V ${raw_SNPSvcf} \
     --filter-expression "QD < 2.0" \
     --filter-name "QDlessthan2" \
     --filter-expression "FS > 60.0" \
     --filter-name "FSgreaterthan60" \
     --filter-expression "MQ < 55.0" \
     --filter-name "MQlessthan55" \
     --filter-expression "MQRankSum < -12.5" \
     --filter-name "MQRankSumlessthanneg12.5" \
     --filter-expression "ReadPosRankSum < -8.0" \
     --filter-name "ReadPosRankSumlessthanneg8" \
     --filter-expression "SOR > 3.0" \
     --filter-name "SORgreaterthan3" \
     --filter-expression "DP > 6214.0" \
     --filter-name "DPgreaterthan6214" \
     --genotype-filter-expression "DP < 2.0" \
     --genotype-filter-name "DPlessthan2" \
     --disable-read-filter NotDuplicateReadFilter \
     -O "${output_name}.${vcf_suffix}"

and then I filtered the SNPs per-pool for population-based filters, like max and min read depth:

gatk --java-options "-Xmx${command_mem}m" VariantFiltration \
     -R ${ref_fasta} \
     -V ${pool_vcf} \
     --disable-read-filter NotDuplicateReadFilter \
     --genotype-filter-expression "DP < 20" \
     --genotype-filter-name "DPlessthan20" \
     --genotype-filter-expression "${maxDP}" \
     --genotype-filter-name "DPgrtrthan2%" \
     --set-filtered-genotype-to-no-call true \
     -O "${pool_name}_poolFiltered.vcf.gz"

I combined the pools back together and used SelectVariants to output a final vcf of the filtered SNPs that passed my filters, and had no missing data (no no-calls, and no non-variant sites)

        gatk --java-options "-Xmx${command_mem}m" SelectVariants \
          -R ${ref_fasta} \
          -V ${vcf} \
          --disable-read-filter NotDuplicateReadFilter \
          --set-filtered-gt-to-nocall true \
          --exclude-non-variants true \
          --exclude-filtered true \
          --remove-unused-alternates true \
          --restrict-alleles-to BIALLELIC \
          --max-nocall-number 0\
          -O "${output_Select_filename}.vcf.gz"

Looking through some of the output loci, I have noticed a pattern in 84 SNPs where there is 0,0 in the AD field, but a value greater than 0 in the FORMAT DP field, and genotype calls (I used a ploidy of 20 so there are 20 "genotypes").
Here are two examples from my filtered VCF file, the first is the most extreme:

LG21    19577273    .   C   T   120.20  PASS    AC=105;AF=0.438;AN=240;BaseQRankSum=-1.133e+00;DP=723;FS=3.853;MQ=60.00;MQRankSum=0.00;QD=2.00;ReadPosRankSum=0.904;SOR=0.261;set=Intersection  GT:AD:DP:GQ:PL  0/0/0/0/0/0/0/1/1/1/1/1/1/1/1/1/1/1/1/1:0,0:79:0:45,26,19,15,12,9,7,6,4,3,2,2,1,0,0,0,0,1,2,4,13    0/0/0/0/0/0/0/0/0/0/0/1/1/1/1/1/1/1/1/1:0,0:76:0:59,27,18,12,8,5,3,2,1,0,0,0,0,1,2,3,5,8,12,20,49   0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/1:0,0:36:0:1,0,0,0,0,0,0,0,0,1,1,2,3,3,5,6,8,10,13,19,82  0/0/0/0/0/0/0/0/0/0/0/0/0/1/1/1/1/1/1/1:0,0:34:0:42,17,10,6,4,2,1,0,0,0,0,1,2,3,5,7,10,14,20,30,69  0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/1:56,4:60:2:109,0,2,9,19,31,45,61,79,99,121,145,173,205,242,287,341,411,511,680,2400  0/0/0/0/0/0/0/0/0/1/1/1/1/1/1/1/1/1/1/1:0,0:70:0:72,33,23,16,12,9,6,4,3,2,1,0,0,0,1,1,3,5,9,16,45   0/0/0/0/0/0/0/0/0/1/1/1/1/1/1/1/1/1/1/1:0,0:32:0:46,24,16,12,9,6,5,3,2,1,1,0,0,0,0,1,2,3,6,11,30    0/0/0/0/0/0/0/0/0/1/1/1/1/1/1/1/1/1/1/1:0,0:56:0:59,29,20,15,12,9,7,5,3,2,1,0,0,0,0,0,1,2,4,9,28    0/0/0/0/0/0/0/0/0/0/0/0/0/1/1/1/1/1/1/1:0,0:72:1:28,11,6,4,2,1,1,0,1,1,2,3,4,6,8,11,15,20,28,41,90  0/0/0/0/0/0/0/0/0/0/0/0/1/1/1/1/1/1/1/1:0,0:53:0:126,28,16,10,6,3,2,1,0,0,1,1,3,5,7,11,16,22,31,49,239  0/0/0/0/0/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1:0,0:83:0:288,73,53,41,32,26,21,16,13,10,7,5,3,2,1,0,0,0,2,6,68  0/0/0/0/0/0/0/0/0/1/1/1/1/1/1/1/1/1/1/1:0,0:72:0:68,32,22,16,12,9,6,4,3,2,1,0,0,0,0,0,1,3,5,10,29

LG20    18511514    .   T   A   465.75  PASS    AC=44;AF=0.183;AN=240;BaseQRankSum=-3.990e-01;DP=685;FS=4.643;MQ=59.62;MQRankSum=0.00;QD=3.11;SOR=0.488;set=Intersection    GT:AD:DP:GQ:PL  0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0:64,0:64:0:0,0,0,0,0,8,24,41,61,82,106,132,162,196,236,283,340,415,519,696,2141  0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/1:64,5:69:0:146,0,0,7,17,30,45,63,82,103,127,154,185,220,261,309,369,446,556,743,2677 0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/1:38,3:41:0:88,0,0,4,10,18,27,37,49,61,76,92,110,131,155,184,220,267,333,445,1613 0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0:48,0:48:11:0,11,23,35,48,62,77,94,111,130,151,173,199,228,261,301,349,412,500,650,2247  0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0:88,0:88:0:0,0,7,25,46,68,92,119,147,179,214,252,295,344,401,469,552,658,807,1056,3047   1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1:0,0:64:3:525,200,158,132,113,98,86,75,65,57,50,43,37,31,26,21,16,11,7,3,0   0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0:54,0:54:9:0,9,18,28,39,50,62,75,89,104,120,139,159,182,209,241,280,330,400,520,1800 1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1:0,0:42:2:286,97,75,63,53,46,40,35,31,27,23,20,17,15,12,10,8,6,4,2,0 0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0:66,0:66:9:0,9,18,28,39,50,62,75,89,104,120,139,159,182,209,241,280,330,400,520,1800 0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0:64,0:64:0:0,0,0,0,8,21,36,53,71,92,114,140,169,202,240,285,340,412,512,679,1965 0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/1/1:36,4:40:2:121,4,0,2,6,12,20,28,38,50,63,77,94,113,135,162,195,238,298,402,1491  0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0:45,0:45:6:0,6,11,18,24,31,39,47,55,65,75,87,99,114,131,151,175,206,250,325,1125

I can't reconcile this pattern. I know that the sum of the AD field does not necessarily have to equal the FORMAT DP because the AD values are unfiltered but do not include uninformative reads.

Is this an artifact of my filtering? It is a symptom of a larger problem? My inclination is to throw the 84 loci that fit this pattern out, but I am worried I will be missing a red flag in my data.

Thank you!

Answers

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited November 2019

    Hi @cmt

    1. Can you please provide the HaplotypeCaller command you use?
    2. Mutect2 is better tool to use for pooled samples. We do not have best practices for this yet, but we think Mutect2 might give better results in this use case i.e. better estimates of allele fraction (and thus probable genotypes) out of the box.. Can you please try it and see if that solves the issue? HaplotypeCaller is known to give weird results in pooled samples use case.
  • cmtcmt Seattle WAMember

    Hi @bhanuGandham !

    This is the command that used to generate those data:

     "/gatk/gatk" --java-options "-Xmx${command_mem_gb}G ${java_opt}" \
          HaplotypeCaller \
          -R ${ref_fasta} \
          -I ${input_bam} \
          -L ${interval_list} \
          --disable-read-filter NotDuplicateReadFilter \
          --pcr-indel-model NONE \
          --sample-ploidy 20\
          --max-alternate-alleles 3 \
          --min-pruning 3 \
          --max-genotype-count 1771 \
          --read-filter OverclippedReadFilter \
          --verbosity ERROR \
          -ERC GVCF \
          -O ${output_filename}
    

    I have been running GATK on GCP, but we can't afford that method anymore, and we are also running out of time. I'm working on getting access to a cluster, but that might be in a month or so.

    I know this is not an easy question to answer, but how weird are the weird results that HaplotypeCaller gives with pooled samples, in your experience? I've had strange issues from the start with GATK, but they have mostly turned out to be bugs that were later fixed. I'm using GATK for marker discovery and most of the papers I'm reading use HaplotypeCaller for their methods. Could you point me in the direction of what I might be risking using HaplotypeCaller instead of Mutect2?

    Thanks!

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @cmt

    We recommend that you use Mutect2. However, if you must use the HaplotypeCaller , it should not be a problem with the callset, but make sure to exclude AD from any filtering steps, because there is a bug in calculating the AD. the next GATK version release will have a fix for it.
    In the future use Mutect2 as it is better for pooled samples.

Sign In or Register to comment.