Mismatch between number of variants in the input and output of the genotypeGVCF

I am merging few hundred of samples for a project level VCF. The following summarize my steps:

a) performed a combineGVCF on a set of gVCF (pVCF1) and then a combineGVCF on another set of gVCF (pVCF2)
b) performed the genotypeGVCF on pVCF1 and pVCF2
c) ran VQSR on this genotypeGVCF output.

What I found is there are variants found in output of genotypeGVCF, but not in pVCF1 and pVCF2, and they all pass the variant filters (VQSRTrancheSNP99.80to99.90 or VQSRTrancheSNP99.70to99.80). I am confused why I am getting these results.

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    Can you please post some example records?
  • chr22 11800146 . A G 26.52 . AC=4;AF=0.012;AN=336;BaseQRankSum=0.712;ClippingRankSum=0.00;DP=3174;ExcessHet=3.1446;FS=42.932;InbreedingCoeff=-0.0306;MLEAC=3;MLEAF=8.929e-03;MQ=54.86;MQRankSum=-1.465e+00;QD=0.37;ReadPosRankSum=-1.286e+00;SOR=5.549 GT:AD:DP:GQ:PL 0/0:45,0:45:99:0,114,1800 0/0:33,0:33:90:0,90,1350 0/0:30,0:30:90:0,90,993 ./.:0,0:0:.:0,0,0 0/0:18,0:18:54:0,54,626 ./.:0,0:0:.:0,0,0 0/0:3,0:3:0:0,0,35 0/0:31,0:31:90:0,90,1108 0/0:26,0:26:72:0,72,1080 0/0:38,0:38:99:0,103,1290 0/0:26,0:26:72:0,72,1080 0/0:2,0:2:0:0,0,10 0/1:7,2:9:29:29,0,196 0/0:19,0:19:22:0,22,591 0/0:21,0:21:39:0,39,585 0/0:17,0:17:37:0,37,522 ./.:0,0:0:.:0,0,0 0/0:19,0:19:54:0,54,810 0/0:33,0:33:93:0,93,1395 0/0:20,0:20:60:0,60,705 ./.:0,0:0:.:0,0,0 0/0:18,0:18:48:0,48,720 0/0:65,0:65:99:0,120,1800 0/0:15,0:15:39:0,39,580/0:11,0:11:33:0,33,362 0/0:30,0:30:55:0,55,833 0/0:21,0:21:60:0,60,690 0/0:11,0:11:33:0,33,362 0/0:15,0:15:45:0,45,471 0/0:17,0:17:16:0,16,451 0/0:2,0:2:6:0,6,70

  • Corresponding VQSR output:

    chr22 11800146 . A G 26.52 VQSRTrancheSNP99.80to99.90 AC=4;AF=0.012;AN=336;BaseQRankSum=0.712;ClippingRankSum=0.00;DP=3174;ExcessHet=3.1446;FS=42.932;InbreedingCoeff=-0.0306;MLEAC=3;MLEAF=8.929e-03;MQ=54.86;MQRankSum=-1.465e+00;QD=0.37;ReadPosRankSum=-1.286e+00;SOR=5.549;VQSLOD=-2.072e+01;culprit=SOR GT:AD:DP:GQ:PL 0/0:45,0:45:99:0,114,1800 0/0:33,0:33:90:0,90,1350 0/0:30,0:30:90:0,90,993 ./.:0,0:0:.:0,0,0 0/0:18,0:18:54:0,54,626 ./.:0,0:0:.:0,0,0 0/0:3,0:3:0:0,0,35 0/0:31,0:31:90:0,90,1108 0/0:26,0:26:72:0,72,1080 0/0:38,0:38:99:0,103,1290 0/0:26,0:26:72:0,72,1080 0/0:2,0:2:0:0,0,10 0/1:7,2:9:29:29,0,196 0/0:19,0:19:22:0,22,591 0/0:21,0:21:39:0,39,585 0/0:17,0:17:37:0,37,522 ./.:0,0:0:.:0,0,0 0/0:19,0:19:54:0,54,810 0/0:33,0:33:93:0,93,1395 0/0:20,0:20:60:0,60,705 ./.:0,0:0:.:0,0,0

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @yyee
    Hi,

    I am confused. Which records are from pVCF1 and pVCF2? Which records are from GenotypeGVCFs? It will be a lot easier for us if you can highlight the inconsistencies between the three outputs.

    Thanks,
    Sheila

Sign In or Register to comment.