Combine vcfs for multi-allelic sites

Kelly135Kelly135 KoreaMember

Hi, I am trying to merge two vcfs (which are created using same pipeline and references) using combinevariants tool.

The variant I am intereted in is indel and has multi-alleles. From one vcf file, this site has five different alleles (ref CAAA, alt C CA CAA CAAAA). And from the second vcf fle, it has three different alleles (ref CAA, alt C CA).

When these two files are merged, it has five alleles. The problem is that samles from the second file have different genotypes compared to the original file. Their alleles were C CA or CAA in the original file, but they were changed to CAA CAAA CAAAA.

When I loaded these two files separately with "variant tool", it seems that they carry CAA CAAA CAAAA, which means CombineVariants worked well.

I wonder why the original file shows different alleles.


Best Answers


  • Kelly135Kelly135 KoreaMember
    edited July 2016

    Hi, I selected only a few samples from the original files and masked the position. Anyway the positions are all same for the three files.
    I used the command as below when combining two files into one.

    java -jar -Xmx2g GenomeAnalysisTK.jar -T CombineVariants -R ucsc.hg19.fasta --variant A.vcf.gz --variant B.vcf.gz -o AB.vcf

    chr3 XXX . CAAAA CAA,CAAAAA,CAAA,CA,C 6275.66 PASS AC=27,19,48,7,5;AF=0.081,0.057,0.144,0.021,0.015;AN=334;BaseQRankSum=0.727;ClippingRankSum=0.322;DP=3376;ExcessHet=108.1939;FS=6.160;MLEAC=26,14,47,7,5;MLEAF=0.078,0.042,0.141,0.021,0.015;MQ=58.58;MQRankSum=0.358;PG=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0;QD=9.55;ReadPosRankSum=0.736;SOR=0.380;VQSLOD=0.238;culprit=QD GT:AD:DP:FT:GQ:PGT:PID:PL:PP 0/0:64,0,0,0,0,0:64:lowGQ:0:.:.:0,0,384,0,384,384,0,384,384,384,0,384,384,384,384,0,384,384,384,384,384:0,0,384,0,384,384,0,384,384,384,0,384,384,384,384,0,384,384,384,384,384 0/1:5,1,0,0,0,0:6:lowGQ:17:.:.:17,0,142,32,145,178,32,145,178,178,32,145,178,178,178,32,145,178,178,178,178:17,0,142,32,145,178,32,145,178,178,32,145,178,178,178,32,145,178,178,178,178 2/2:0,0,1,0,0,0:1:lowGQ:5:.:.:34,34,34,5,6,0,34,34,6,34,34,34,6,34,34,34,34,6,34,34,34:34,34,34,5,6,0,34,34,6,34,34,34,6,34,34,34,34,6,34,34,34

    CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 2000017 2000020 2000030 2000072 2000101
    chr3 XXX . CA CAA,C,CAAA 24037.25 PASS AC=172,53,10;AF=0.222,0.068,0.013;AN=774;BaseQRankSum=0.197;ClippingRankSum=0.156;DP=5967;ExcessHet=83.4894;FS=1.104;MLEAC=176,48,7;MLEAF=0.227,0.062,9.044e-03;MQ=57.78;MQRankSum=0.067;PG=0,0,0,0,0,0,0,0,0,0;QD=10.51;ReadPosRankSum=0.067;SOR=0.803;VQSLOD=1.99;culprit=FS GT:AD:DP:FT:GQ:PL:PP 0/2:18,0,4,0:22:lowGQ:2:2,54,370,0,316,305,54,370,316,370:2,54,370,0,316,305,54,370,316,370 0/0:17,0,0,0:19:PASS:51:0,51,525,51,525,525,51,525,525,525:0,51,525,51,525,525,51,525,525,525 0/0:17,0,0,0:17:lowGQ:18:0,18,409,18,409,409,18,409,409,409:0,18,409,18,409,409,18,409,409,409 0/1:4,9,2,0:15:PASS:57:167,0,70,180,57,475,179,82,264,261:167,0,70,180,57,475,179,82,264,261 0/0:21,0,0,0:21:lowGQ:3:0,3,515,3,515,515,3,515,515,515:0,3,515,3,515,515,3,515,515,515

    CHROM POS ID REF ALT QUAL FILTER INFO FORMAT A1 A2 A3 2000017 2000020 2000030 2000072 2000101
    chr3 XXX . CAAAA CAA,CAAAAA,CAAA,CA,C,CAAAAAA 6275.66 PASS AC=27,191,101,7,5,10;AF=0.024,0.172,0.091,6.318e-03,4.513e-03,9.025e-03;AN=1108;DP=9343;set=Intersection GT:DP:FT:GQ:PGT:PID:PP 0/0:64:lowGQ:0:.:.:0,0,384,0,384,384,0,384,384,384,0,384,384,384,384,0,384,384,384,384,384 0/1:6:lowGQ:17:.:.:17,0,142,32,145,178,32,145,178,178,32,145,178,178,178,32,145,178,178,178,178 2/2:1:lowGQ:5:.:.:34,34,34,5,6,0,34,34,6,34,34,34,6,34,34,34,34,6,34,34,34 0/3:22:lowGQ:2:.:.:2,54,370,0,316,305,54,370,316,370 0/0:19:PASS:51:.:.:0,51,525,51,525,525,51,525,525,525 0/0:17:lowGQ:18:.:.:0,18,409,18,409,409,18,409,409,409 0/2:15:PASS:57:.:.:167,0,70,180,57,475,179,82,264,261 0/0:21:lowGQ:3:.:.:0,3,515,3,515,515,3,515,515,515

    For the sample ID 2000017, he had genotype 0/2 (which is then CA/C) in the original file, but his genotype became 0/3 (CAAAA/CAAA) in the combined file.

  • Kelly135Kelly135 KoreaMember

    @Sheila Sorry, I selected your question as an answer by mistake, but don't know how to cancel it.

  • Kelly135Kelly135 KoreaMember
    I see, but still have a question.
    In my opinion, if the reference is CAAA and one sample has C/CA, then this person carry 3 deletions/two deletions. The idea that CA/C and CAAAA/CAAA genotyoes are equal in that one A allele is deleted is from the comparison between two alleles (not compared to reference allele). Sorry if I'm wrong. So confused.

    In my case, one dataset is from cases and another is from controls.
    When I see two datasets separately, most controls had shorter alleles (C CA CAA CAAA), and many cases had longer ones (C CA CAA CAAA CAAAA CAAAAA). So I thought that controls tend to have shorter variants. But when both data were combined and the normalization happened, so most controls had longer variants (CAAA CAAAA CAAAAA) and cases remained unchanged. So it seems that cases carry more deletion (carry shorter variants) which is a totally opposite idea from the first conclusion I made with two separate files.
    In this case, the latter idea concluded from the combined data is right?

    Thanks a lot.
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @Kelly135 Variant representations at a site like this can be very confusing, so it can be helpful to draw all the alleles for each sample relative to the reference, to understand what the allele calls and genotype assignments represent. I strongly recommend you try doing this on a piece of paper or a whiteboard.

Sign In or Register to comment.