Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

inadvertent removal of monomorphic/invariant sites

nketchinketchi Member
edited May 16 in Ask the GATK team
Hi GATK team,

I've been having some problems keeping monomorphic (invariant) sites in my vcf file. I need high quality monomorphic sites for generating an accurate site frequency spectrum. I was wondering if someone could offer some insight as to what stage of my pipeline the monomorphic sites are being lost.

using GATK/4.0.5.1 for haplotype calling and GATK/3.8-0 for the rest

I run haplotype caller as follows:
~/bin/gatk --java-options "-Xmx10g" HaplotypeCaller -R $reference -I ${insampleID}.bam -O ${outsampleID}_g.vcf.gz -ERC GVCF

then consolidate and genotype:

~/bin/gatk --java-options "-Xmx10g -Xms4g" CombineGVCFs -R $reference $cmd -O $output_vcfs/pop_cohort.g.vcf

~/bin/gatk --java-options "-Xmx10g -Xms4g" GenotypeGVCFs -R $reference --variant --includeNonVariantSites $output_vcfs/pop_cohort.g.vcf -O $output_vcfs/pop_genotyped_cohort.g.vcf

Then various filters:

## Select only snps with the "snp_filter"
~/bin/gatk --java-options "-Xmx10g" SelectVariants -R $reference -V $raw_vcf --select-type-to-include SNP -O $SNP_filtered

~/bin/gatk --java-options "-Xmx10g" VariantFiltration -R $reference -V $SNP_filtered -O $gatk_filter_flag --filter-expression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || HaplotypeScore > 13.0 || MappingQualityRankSum < -12.5" --filter-name "snp_filter"

vcftools --vcf $gatk_filter_flag --recode --remove-filtered-all --out $gatk_filtered

## Retain only biallelic SNPs with a min depth of 5 and max depth of 200
vcftools --vcf $gatk_filtered.recode.vcf --min-alleles 2 --max-alleles 2 --minDP 5 --maxDP 200 --recode --remove-filtered-all --out $allele_filtered

NB by looking at the size of the vcf files and the number of variants, it looks as though this step is the one which drops the sites. Is it something to do with the minDP flag for 0/0 and ./. calls?

## Split vcf file by population and filter by max missing 50
vcftools --vcf $allele_filtered.recode.vcf --keep $pop --recode --remove-filtered-all --out $allele_filtered.pop
vcftools --max-missing 0.5 --vcf $allele_filtered.pop.recode.vcf --recode --remove-filtered-all --out $maxmiss_filtered.pop

Any advice greatly appreciated!

thanks

Answers

  • AdelaideRAdelaideR Member admin

    This is quite a complex issue, it would take a while to figure out what is happening with sites that are ./. without diving into the actual files. But, I have a hunch that the filters are removing these sites because the DP may be 0. So, one way to retain them might be to set the --minDP value to 0.

    There is a slightly older discussion on the forum about this topic. Perhaps some of the observations found there could be useful.

  • nketchinketchi Member
    Hi AdelaideR,

    Thanks for getting back to me.

    unfortunately, we have run into another problem.

    My first question is about the --includeNonVariantSites option. I can't see an update about whether or not this was eventually incorporated into GATK4. I can see there are many posts about it, and requests on github however. Could you update on this?

    for now the workaround has been to do the following:

    in GATK3:

    java -jar $EBROOTGATK/GenomeAnalysisTK.jar -T GenotypeGVCFs -R $reference --variant $variant --includeNonVariantSites -o variants_nonvar.vcf

    then switch to GATK4:

    gatk SelectVariants -R $reference -V variants_nonvar.vcf --select-type-to-include NO_VARIATION --output test.vcf

    but this keeps the indels! So we try to explicitly remove them:

    gatk SelectVariants -R $reference -V variants_nonvar.vcf --select-type-to-include NO_VARIATION --select-type-to-exclude INDEL --output test.vcf

    but this still includes the indels.

    We've tried this both in GATK3 and GATK4 to no avail. Not a huge problem as we can use some awk code to remove them, but I thought we should flag this up

    Do you have any suggestions?

    Thanks again for your help,

    Josie
  • AdelaideRAdelaideR Member admin

    @nketchi

    Hi Josie - Thank you for bringing this issue to our attention. I will work with @bhanuGandham and the GATK development team to submit a ticket on this issue or get more answers on how best to remove the indels.

  • nketchinketchi Member
    edited May 30
    Thanks AdelaideR! For now, we will continue to remove them manually, but happy to hear an update when you have one!
  • bshifawbshifaw Member, Broadie, Moderator admin

    @nketchi ,
    Looks like --include-non-variant-sites has been added to gatk4 GenotypeGVCFs according to this git ticket 5219.

Sign In or Register to comment.