Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Hard filters for Haplotype Caller

Will_GilksWill_Gilks University of Sussex, UKMember ✭✭

Hi team, (this is really two questions)

  1. Do you have any recommendations for hard-filtering haplotypecaller-generated vcfs ?

This was my previous filter for the unifiedgenotyper output"

`GenomeAnalysisTK -R ${ref} \
    -T VariantFiltration \
        -V ${my_vcf} \
            -filter "QUAL<1000.0" -filterName "LowQual" \
                -filter "MQ0>=4&&((MQ0/(1.0*DP))>0.1)" -filterName "BadVal" \
                    -filter "MQ<60" -filterName "LowMQ" \
                        -filter "QD<5.0" -filterName "LowQD" \
                            -filter "FS>60" -filterName "FishStra" \
                                  -filter "DP<2000" -filterName "lowTotDP" \
                                        -o qual_marked.vcf`

Obviously fields such as MQ0 won't work as this isn't present in the HC-generated vcf, and obviously there are many fields to filter on. (There are 222 samples and 1.9m variants in the vcf)

  1. One filter that I'm really keen to apply but never got the hang-of, is to drop all individual genotype calls where the coverage is less than 10X. (This is because I'm really interested in getting the genotype correct, rather than actually detecting mutations).

Sincerely,

William Gilks

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Will_Gilks
    Hi William,

    We do have some basic hard filtering recommendations here: http://gatkforums.broadinstitute.org/discussion/2806/howto-apply-hard-filters-to-a-call-set
    However, you may have to tweak some of the thresholds to better suit your data.

    Also, have a look at our g=Genotype Refinement workflow which will help you get better genotype calls. http://gatkforums.broadinstitute.org/discussion/4723/genotype-refinement-workflow

    -Sheila

  • Will_GilksWill_Gilks University of Sussex, UKMember ✭✭

    Thanks @Sheila clearly a matter of me looking in the correct places. GATK guides seem to be becoming rather scattered.

    How about my question 2.. How to set individual genotype calls which have less than 10X coverage to ./. ?

  • Will_GilksWill_Gilks University of Sussex, UKMember ✭✭

    @Sheila Ok, thanks for the link, I must've missed before. Example code below.

    GenomeAnalysisTK -R ${ref_genome} \ -T VariantFiltration \ -V raw.vcf \ -filter "QUAL<1000.0" -filterName "lowQual" \ -filter "MQ<60" -filterName "lowMQ" \ -filter "QD<5.0" -filterName "lowQD" \ -filter "FS>30" -filterName "hiFS" \ -G_filter "DP<10" -G_filterName "lowiDP" \ -o marked.vcf

    GenomeAnalysisTK -R ${ref_genome} \ -T SelectVariants \ -V marked.vcf \ --excludeFiltered \ --setFilteredGtToNocall \ -o filtered.vcf

  • SumuduSumudu Sri LankaMember

    Hi,

    I have generated gvcfs for 8 clinical samples where 2 of them belong to a different phenotype of disease. First I combined those 2 gvcfs together and the other 6 in to a different vcf file. I applied hard filters to the 2 vcf files separately.

    Then I tried combining all the 8 gvcfs in to a single vcf file and applied hard filters.

    What I have seen was that in the latter case non of the variants have filtered out. But in the first case variants have been filtered.

    I have chosen hard filtering thresholds after analyzing graphs for my raw variants. Could it be that INFO field annotations are calculated considering the samples in my vcf? In that case is it correct to follow the first method where I generated 2 vcfs for my phenotypically different samples?

    Thanks
    -Sumu

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Sumudu
    Hi Sumu,

    You should analyze all 8 samples together (run GenotypeGVCFs on all 8 sample GVCFs together). Can you post an example of where a site is filtered in one case but not the other?

    Thanks,
    Sheila

  • SumuduSumudu Sri LankaMember

    @Sheila :Thank you.

    I have realized that my above post indicating that non of the variants have filtered is by mistake. Actually I realized that records are not removed but FILTER/PASS is added. So some have actually filtered. But still I have a confusion appreciate if you could help me out.

    Here's the code and a SNP present when I analyze all 8 samples together but absent otherwise. ( When I run GenotyGVCFs on 6 samples together and GenotypeGVCFs on other 2 samples separately considering phenotypic differences)

    java -Xmx15g -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R Ldonovani_BPK282A1.genome.fasta -stand_call_conf 30 -V L2331_S1.g.vcf -V L2339_S2.g.vcf -V L2343_S3.g.vcf -V L2372_S4.g.vcf -V R5_S5.g.vcf -V R7_S6.g.vcf -V VL28_S7.g.vcf -V VL38_S8.g.vcf -o ~/Documents/New_VA_1/Raw_variants.vcf
    ....................................................................................
    Ld01_v01s1 319 . G A 385.91 PASS AC=3;AF=0.188;AN=16;BaseQRankSum=0.470;ClippingRankSum=0.00;DP=182;ExcessHet=3.9794;FS=13.361;MLEAC=3;MLEAF=0.188;MQ=71.06;MQRankSum=0.00;QD=5.76;ReadPosRankSum=-5.300e-01;SOR=2.444 GT:AD:DP:GQ:PGT:PID:PL 0/1:37,6:43:99:0|1:319_G_A:141,0,1874 0/0:38,0:38:73:.:.:0,73,839 0/0:28,0:28:84:.:.:0,84,687 0/0:17,0:17:50:.:.:0,50,442 0/0:22,1:23:24:0|1:305_A_G:0,24,945 0/1:10,6:16:99:0|1:305_A_G:222,0,402 0/1:6,2:8:66:0|1:319_G_A:66,0,246 0/0:6,0:6:18:0|1:319_G_A:0,18,270

    ....................................................................................

    Here's a SNP I found in my 8 sample vcf and also as the 1st SNP in my 6 sample vcf.

    Ld01_v01s1 423 . A G 304.45 PASS AC=1;AF=0.063;AN=16;BaseQRankSum=0.175;ClippingRankSum=0.00;DP=196;ExcessHet=3.7566;FS=3.796;MLEAC=1;MLEAF=0.063;MQ=58.22;MQRankSum=-3.756e+00;QD=13.24;ReadPosRankSum=-1.710e+00;SOR=0.588 GT:AD:DP:GQ:PGT:PID:PL 0/0:38,0:38:0:.:.:0,0,611 0/0:35,0:35:62:.:.:0,62,819 0/0:20,0:20:57:.:.:0,57,855 0/0:19,0:19:54:.:.:0,54,810 0/0:36,0:36:0:.:.:0,0,611 0/1:14,9:23:99:0|1:396_G_A:336,0,561 0/0:7,0:7:21:.:.:0,21,154 0/0:18,0:18:48:.:.:0,48,720

    Ld01_v01s1 423 . A G 52.50 PASS AC=1;AF=0.083;AN=12;BaseQRankSum=-7.590e-01;ClippingRankSum=0.00;DP=136;ExcessHet=3.3579;FS=0.000;MLEAC=1;MLEAF=0.083;MQ=90.25;MQRankSum=-3.888e+00;QD=3.09;ReadPosRankSum=-2.546e+00;SOR=0.906 GT:AD:DP:GQ:PGT:PID:PL 0/0:22,0:22:66:0|1:396_G_A:0,66,1637 0/0:35,0:35:62:.:.:0,62,819 0/0:20,0:20:57:.:.:0,57,855 0/0:19,0:19:54:.:.:0,54,810 0/0:23,0:23:0:.:.:0,0,303 0/1:14,3:17:84:0|1:396_G_A:84,0,579

    By 8 samples together I got ~2M SNPs but by 6 samples together I got ~200,000 SNPs and in the other 2 samples together I got ~420,000 SNPs. Which is a huge difference. I generated stats using 'bcftools stats' command. My parasite organism is ~32MB in size.
    Appreciate your advice on this analysis.

    Thanks
    Sumu.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Sumudu
    Hi Sumu,

    Can you please post the GVCF records from the variant samples at those sites?

    It is definitely possible to recover low GQ sites in the GVCF if they are variant in other samples. In your first case, the variant samples have low GQs, but when GenotypeGVCFs sees other samples have a variant at the site, there is a better chance the site will actually be called as variant. That is why it is best to analyze as many samples together as possible.

    -Sheila

  • SumuduSumudu Sri LankaMember

    @Sheila :Thank you.

    Below are the gvcf records for the above mentioned positions 319 & 423 in all my samples.

    sample1

    CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE

    Ld01_v01s1 319 . G A, 0 . DP=35;ExcessHet=3.0103;MLEAC=0,0;MLEAF=0.00,0.00;RAW_MQ=126000.00 GT:AD:DP:GQ:PGT:PID:PL:SB 0/0:34,0,0:34:99:0|1:319_G_A:0,102,2357,102,2357,2357:17,17,0,0
    Ld01_v01s1 423 . A G, 0 . DP=22;ExcessHet=3.0103;MLEAC=0,0;MLEAF=0.00,0.00;RAW_MQ=79200.00 GT:AD:DP:GQ:PGT:PID:PL:SB 0/0:22,0,0:22:66:0|1:396_G_A:0,66,1637,66,1637,1637:11,11,0,0

    sample 2
    Ld01_v01s1 319 . G . . END=319 GT:DP:GQ:MIN_DP:PL 0/0:34:61:34:0,61,740
    Ld01_v01s1 423 . A . . END=423 GT:DP:GQ:MIN_DP:PL 0/0:35:62:35:0,62,819

    sample3
    pos 319 & 423 doesn't exist

    sample4
    pos 319 & 423 doesn't exist

    sample5
    Ld01_v01s1 319 . G . . END=319 GT:DP:GQ:MIN_DP:PL 0/0:31:7:31:0,7,631
    Ld01_v01s1 423 . A . . END=423 GT:DP:GQ:MIN_DP:PL 0/0:23:0:23:0,0,303

    sample6
    Ld01_v01s1 319 . G . . END=319 GT:DP:GQ:MIN_DP:PL 0/0:20:0:20:0,0,56
    Ld01_v01s1 423 . A G, 55.77 . BaseQRankSum=-0.759;ClippingRankSum=0.000;DP=17;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=-3.888;RAW_MQ=59273.00;ReadPosRankSum=-2.546 GT:AD:DP:GQ:PGT:PID:PL:SB 0/1:14,3,0:17:84:0|1:396_G_A:84,0,579,126,588,714:9,5,2,1

    sample7
    Ld01_v01s1 319 . G . . END=319 GT:DP:GQ:MIN_DP:PL 0/0:10:0:10:0,0,138
    pos 423 doesn't exist

    sample8
    Ld01_v01s1 319 . G A, 108.18 . DP=3;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQ=8356.00 GT:AD:DP:GQ:PGT:PID:PL:SB 1/1:0,3,0:3:9:0|1:319_G_A:135,9,0,135,9,135:0,0,0,3
    pos 423 doesn't exist

    ""That is why it is best to analyze as many samples together as possible."" According to your comment above, even if we know the samples have phenotypic differences, e.g. drug resistance or different disease tropisms? That big difference in the # of SNPs resulted after just adding 2 more samples(which were found to have a different phenotype) to the GenotypeGVCFs.

    Even after hard filtering I still have more than ~1.5M variants to analyze.

    Thank you
    Sumu

  • sysulssysuls ChinaMember

    Hi Sheila,

    I want to get reliable snps from a single plant individual resequencing data. And I get so confused that what hard filtering threshold should I choose because the density distribution of FS, QD, ReadPosRankSum can be different from the distribution you plot in this page(https://software.broadinstitute.org/gatk/guide/article?id=6925). And 71.5% of FS value of the snps are below 0.001 ( which is 0.000 in the vcf file).

    So do you have any recommendations for hard-filtering threshold?
    (PS: The density distribution of FS, QD and ReadPosRankSum are plotted in the pdf files.)

    Thanks.

    Saib

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    @Saib What is the plant you work with? Is it diploid, and if not, what is its ploidy? Is it assumed to have wild population genetics or is it an agriculturally modified varietal? These may all influence the annotation profiles.
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Sumudu
    Hi Sumu,

    I think this article will answer your questions.

    Notice the GQs are relatively low for most of the samples. When you add in the evidence from other samples, the confidence that the site is variant increases.

    -Sheila

  • sysulssysuls ChinaMember

    @Geraldine_VdAuwera
    Thanks.
    This data is from a wild individual of Sonneratia alba, which is a diploid plant.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
    edited January 2017

    @sysuls
    Hi Saib,

    There certainly are bound to be differences in plots of annotation values between species as well as within species. The annotation values depend on your data. That said, your plots look fine to me. The FS values should be low (low values mean less chance of strand bias). The best thing to do is play around with the filter thresholds you set, and see which ones work best for you. Have a look at this tutorial and this tutorial on filtering as well.

    -Sheila

    P.S. You may also have a look at what exactly each annotation value means. You can find the descriptions in the tool documentation under Annotation Modules. That will help you better understand how to set thresholds too. Good luck!

Sign In or Register to comment.