Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!

How do I get the unique SNPs for each strain?

antonioggsousaantonioggsousa Oeiras, PortugalMember
Hi,

First, I would like to say that I'm new to genome and variant calling data analyses and, of course, to GATK.
I read some tutorials and the BestPractices guide before I start. I'm using gatk version 4.1.4.0.

So, let's go to the point. I have a couple of genomes for different yeast strains. I'm interest into SNPs that are unique to each of these strains. So, I started by aligning these against the reference genome available with bwa-mem, convert the SAM to BAM, and sorted the BAM files, for each of the genomes that I have.
Then, I used PICARD to fix the read group header and finally I used GATK.
Regarding GATK, the commands that I used were the following:

# call variants for each genome alignment file with 'HaplotypeCaller' (´$´ stands for bash variables; "${bam}" are just the prefix of sample names over which I loop):
$GATK HaplotypeCaller -R $DB -I ${IN}/${bam}.sort.rg.bam -O ${OUT}/${bam}.g.vcf -ERC GVCF

# then I realize that in the new version of GATK you need to combine first your multiple '.vcf' files before calling GenotypeGVCFs. In this command I used the 'g.vcf' files generated before to combine the variants of the couple of genomes that I have :
$GATK CombineGVCFs -R $DB --variant ${OUT}/${1}.g.vcf --variant ${OUT}/${2}.g.vcf --variant ${OUT}/${3}.g.vcf --variant ${OUT}/${4}.g.vcf -O ${OUT}/multiple_yeast_strain.g.vcf

# finally I call the variants for the combined gvcf file with 'GenotypeGVCFs' command:
$GATK GenotypeGVCFs -R $DB -V ${OUT}/multiple_yeast_strain.g.vcf -O ${OUT}/genotype_merged.vcf

So far, so good (I think!). Then to get the unique variants, particularly the SNPs, for each strain, I mean the SNPs that appeared in strain A, but does not appear in the remaining strains, the SNPs that appear in strain B, but does not appear in strain A neither in any other strain, and so. For this I thought to use the 'SelectVariants' command in GATK with the option '--discordance', like this (for one of the strains):

# I call the variant obtained with 'HaplotypeCaller' for the strain that I'm interested in, lets say strainA.g.vcf, with the '-V' option and I want the SNPs that appear in strainA.g.vcf but not in the joint GenotypeGVCFs command, so I give the file generated previously to '--discordance' and I select only the type of variants that I want: "-select-type SNP"
$GATK SelectVariants -R $DB --discordance ${OUT}/genotype_merged.vcf -V ${OUT}/strainA.g.vcf -select-type SNP -O strainA.unique.g.vcf

# then, when I run the following command to get the no. of SNPs unique to strainA, I got "0 SNPs"
$GATK CountVariants -V strainA.unique.g.vcf

When I wrote this I realize that actually it may not make too much sense to get the unique variants for strainA from a joint gvcf file where the strainA it's there. So, I tried to use the "--concordance" option, just to look if worked and I still got "0 SNPs". So, I'm struggling with what I'm doing wrong and a possible solution to get the unique SNPs for each strain.
I hope that you can help me on this.
Thanks in advance,
António

Answers

Sign In or Register to comment.