Holiday Notice:
The Frontline Support team will be offline February 18 for President's Day but will be back February 19th. Thank you for your patience as we get to all of your questions!
The Frontline Support team will be offline February 18 for President's Day but will be back February 19th. Thank you for your patience as we get to all of your questions!
How to compare two vcf files of two cultivars?

Hi,
I am looking for a program that could help to compare two vcf files of two cultivars. I have three vcf files (vcf1, vcf2, vcf3) of SNPs of three different cultivars. The SNPs were called using GATK's UnifiedGenotyper. After stringent filtering with GATK's hard filters, I found approx. 6 millions SNPs in these three cultivars. Now I would like to compare two vcf files of SNPs of any two cultivars (e.g., the combination can be: vcf1/vcf2, vcf1/vcf3 and vcf2/vcf3) to identify common and specific SNPs, and also like to generate venn diagrams between the combinations.
Can anyone suggest me - which tool I can use to identify common and specific SNPs between two cultivars? I do appreciate your help.
Answers
@shis
Hi,
Have a look at this page which should help.
-Sheila
@Sheila
Thank you so much for the link. I got the intersection from the combined vcf file of the two vcf files using the following command:
combine the data
java -Xmx2g -jar $GATK -T CombineVariants -R IRGSP.fasta -V cultivar1.vcf -V cultivar2.vcf -o union.vcf
select the intersection
java -Xmx2g -jar $GATK -T SelectVariants -R IRGSP.fasta -V union.vcf -select 'set == "Intersection";' -o intersect.vcf
Now I would like to separate SNPs that are not in the intersect area. How can I get SNPs that are different from each other?
@shis
Hi,
Right now, you have all the sites where the two cultivars are variant. You can simply invert the selection criteria in SelectVariants using
--invertselect
. That will choose the sites not in set==intersection.-Sheila
Hi Sheila,
According to your suggestions, I used the following command to get variants from the two cultivars where the variant sites is not 'set==intersection':
$ java -Xmx16g -jar $GATK -T SelectVariants -R ./ref_genome/IRGSP.fasta -V cultivar1_cultivar2_union_final.vcf -select 'set == "Intersection";' -o cultivar1_cultivar2_variant_final.vcf -invertSelect
Now my question is: are these SNPs found between two cultivars with respect to reference genome or without reference genome?
Thanks for the help.
/Shis
@shis
Hi,
You now have the sites that are either variant in one set or variant in the other. Those sites will always be in respect to the reference, but will have different genotypes from each other. I hope this helps.
-Sheila
@Sheila, many thanks for the answer.