Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Filtering of heterozygotes only

henriettevdzhenriettevdz South Africa Member

Hi!

I need help, please!
I'm working on lovebirds and trying to identify SNPs that can be included in a parentage verification panel. The reference genome is the offspring and then I have mapped its parents' reads to the reference to identify SNPs. I want to identify only those SNPs where the mother and father are both heterozygotes, which will imply that all four the grandparents also had a polymorphism at that site.

I did hard filtering using the following parameters:
Firstly as the best practises guidelines suggests:
QD<2 || FS>60 || MQ<40 || MQRankSum<-12.5 || ReadPosRandSum < -8.0 And then to filter in the heterozygotes: QD>2 || FS <10 || MQ >50 || MQRankSum >-5.1 || ReadPosRandSum <-8.0

The mother is more heterozygous than the father and I get around (raw) 1.9mil SNPs for her vs 1.2mil for the father. After filtering, there is of course much less.

I then combined the genotypes of the two parents and repeated the process.

The results I get for both the filtering parameters and the combined and separate genotypes are not bad, but I wish to only have those SNPs where both the mother and father are heterozygous for the SNP. I've checked the results on igv and it seems that about 1 in every 10-20 SNPs that was filtered in complying to this. However, I cannot see any difference in parameters or quality or anything to filter these further. I went through them manually and selected those I wanted, but there were no significant similarities in this subset to be able to filter them from the rest.

So my questions are:
1. Is there any way to filter out only those SNPs that are heterozygous for all individuals, other than going through them manually?
2. Some of the SNPs with the highest quality are heterozygous but less than 20% of the reads have the alternative allele. Can I select these or should I go for lower quality but a higher % of alternative allele (e.g. 50%).

Thanks a lot!
Henriette

Comments

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @henriettevdz
    Hi Henriette,

    I am confused. Why are you using different filters for heterozygotes after applying filters for all the sites?

    If you want to select the sites where both the mother and father are heterozygous, you can use -select 'vc.getGenotype("mother").isHet()' -select 'vc.getGenotype("father").isHet()'

    Have a look at this document for more information.

    -Sheila

  • henriettevdzhenriettevdz South Africa Member

    Hi Sheila!

    To tell you the truth I'm very new to bioinformatics and have learnt the basics of programming by following the Best Practises guidelines. I've followed the guidelines to apply the hard filters and thought perhaps if I apply different filters it will only give me only the heterozygotes which it didn't. The document you have provided helps a lot, thanks!

    Because I'm a total novice I just want to ask about the command you have given me, please...

    Will this be correct, or should there be something in the () brackets?
    java -jar GenomeAnalysisTK.jar -T SelectVariants -R reference.fasta -V raw_snps.vcf -select "vc.getGenotype("mother").isHet()' -select 'vc.getGenotype("father").isHet()'

    Is it correct to use the combined genotype file?

    The ("mother") and ("father") fields - should I include something else here? Is it supposed to be e.g. father.vcf or is it just "father"?

    Thanks so much for your help!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @henriettevdz
    Hi Henriette,

    Ah okay. Thanks for the clarification. Your command above looks good. There is no need to have anything in the ().

    Yes, you should use the combined final VCF. The "mother" and "father" should be the actual mother and father sample names in the VCF.

    -Sheila

  • henriettevdzhenriettevdz South Africa Member

    Thanks a lot!

Sign In or Register to comment.