Holiday Notice:
The Frontline Support team will be offline February 18 for President's Day but will be back February 19th. Thank you for your patience as we get to all of your questions!

How to compare two vcf files of two cultivars?

Hi,
I am looking for a program that could help to compare two vcf files of two cultivars. I have three vcf files (vcf1, vcf2, vcf3) of SNPs of three different cultivars. The SNPs were called using GATK's UnifiedGenotyper. After stringent filtering with GATK's hard filters, I found approx. 6 millions SNPs in these three cultivars. Now I would like to compare two vcf files of SNPs of any two cultivars (e.g., the combination can be: vcf1/vcf2, vcf1/vcf3 and vcf2/vcf3) to identify common and specific SNPs, and also like to generate venn diagrams between the combinations.

Can anyone suggest me - which tool I can use to identify common and specific SNPs between two cultivars? I do appreciate your help.

Best Answer

Answers

  • shisshis USAMember
    edited January 2017

    @Sheila
    Thank you so much for the link. I got the intersection from the combined vcf file of the two vcf files using the following command:

    combine the data

    java -Xmx2g -jar $GATK -T CombineVariants -R IRGSP.fasta -V cultivar1.vcf -V cultivar2.vcf -o union.vcf

    select the intersection

    java -Xmx2g -jar $GATK -T SelectVariants -R IRGSP.fasta -V union.vcf -select 'set == "Intersection";' -o intersect.vcf

    Now I would like to separate SNPs that are not in the intersect area. How can I get SNPs that are different from each other?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @shis
    Hi,

    Right now, you have all the sites where the two cultivars are variant. You can simply invert the selection criteria in SelectVariants using --invertselect. That will choose the sites not in set==intersection.

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    You can also select individual sets by name to isolate them to separate files, if that is of interest. Check out the workshop tutorials for more details on how to do this.
  • shisshis USAMember

    Hi Sheila,
    According to your suggestions, I used the following command to get variants from the two cultivars where the variant sites is not 'set==intersection':
    $ java -Xmx16g -jar $GATK -T SelectVariants -R ./ref_genome/IRGSP.fasta -V cultivar1_cultivar2_union_final.vcf -select 'set == "Intersection";' -o cultivar1_cultivar2_variant_final.vcf -invertSelect

    Now my question is: are these SNPs found between two cultivars with respect to reference genome or without reference genome?
    Thanks for the help.
    /Shis

    @Sheila said:
    @shis
    Hi,

    Right now, you have all the sites where the two cultivars are variant. You can simply invert the selection criteria in SelectVariants using --invertselect. That will choose the sites not in set==intersection.
    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
    edited January 2017

    @shis
    Hi,

    You now have the sites that are either variant in one set or variant in the other. Those sites will always be in respect to the reference, but will have different genotypes from each other. I hope this helps.

    -Sheila

  • shisshis USAMember

    @Sheila, many thanks for the answer.

Sign In or Register to comment.