How do SelectVariants and VariantEval handle spanning deletion "*"?

Hi,

I have recently used GATK 3.5 to run joint calling (GenotypeGVCFs). After that, I used SelectVariants to select one sample and ran VariantEval with the sample's data. However, I am curious about how these two tools handle the spanning deletion alleles.

For example, this is the parameter I used for SelectVariants
java -jar GenomeAnalysisTK.jar -R ${ref} -T SelectVariants -env -sn ${sample} -V ${joint_vcf} -trimAlternates -o ${output}

In the output vcf, I observed these records

1       10397   .       CCCCTAA C       712.91  PASS    AC=2;AF=1.00;AN=2;DP=4;ExcessHet=0.5902;FS=0.927;InbreedingCoeff=0.1184;MQ=6.13;MQ0=0;NEGATIVE_TRAIN_SITE;QD=16.20;SOR=0.560;VQSLOD=-1.122e+00;culprit=DP       GT:AD:DP:GQ:PGT:PID:PL  1/1:0,4:4:12:1|1:10397_CCCCTAA_C:169,12,0  
1       10403   .       ACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC  *       3432.77 VQSRTrancheINDEL99.00to100.00   AC=2;AF=1.00;AN=2;BaseQRankSum=-2.710e-01;ClippingRankSum=0.406;DP=4;ExcessHet=29.5046;FS=13.193;InbreedingCoeff=-0.1604;MQ=24.73;MQ0=0;MQRankSum=0.868;NEGATIVE_TRAIN_SITE;QD=11.11;ReadPosRankSum=-7.200e-01;SOR=1.681;VQSLOD=-1.715e+00;culprit=DP   GT:AD:DP:GQ:PGT:PID:PL  1/1:0,4:4:12:1|1:10397_CCCCTAA_C:169,12,0    
1       10583   .       G       A       5766.70 PASS    AC=1;AF=0.500;AN=2;BaseQRankSum=-6.960e-01;ClippingRankSum=-2.620e-01;DP=19;ExcessHet=79.3063;FS=0.000;InbreedingCoeff=-0.5072;MQ=26.43;MQ0=0;MQRankSum=-1.390e-01;NEGATIVE_TRAIN_SITE;QD=5.64;ReadPosRankSum=-3.610e-01;SOR=1.792;VQSLOD=-2.703e+00;culprit=InbreedingCoeff    GT:AB:AD:DP:GQ:PL       0/1:0.790:15,4:19:75:75,0,429                                                
1       10616   .       CCGCCGTTGCAAAGGCGCGCCG  C       28746.84        PASS    AC=2;AF=1.00;AN=2;DP=10;ExcessHet=3.0103;FS=0.000;InbreedingCoeff=0.1238;MQ=23.66;MQ0=0;QD=32.93;SOR=9.720;VQSLOD=0.122;culprit=FS      GT:AD:DP:GQ:PL  1/1:0,10:10:31:451,31,0  
1       10619   .       CCGTTG  *       29067.03        PASS    AC=2;AF=1.00;AN=2;DP=10;ExcessHet=3.0103;FS=0.000;InbreedingCoeff=0.1081;MQ=53.67;MQ0=0;QD=30.83;SOR=9.738;VQSLOD=0.515;culprit=FS      GT:AD:DP:GQ:PL  1/1:0,10:10:31:451,31,0             
1       10622   .       T       *       29176.81        PASS    AC=2;AF=1.00;AN=2;DP=10;ExcessHet=3.0103;FS=0.000;InbreedingCoeff=0.0811;MQ=46.61;MQ0=0;NEGATIVE_TRAIN_SITE;QD=29.67;SOR=9.747;VQSLOD=-2.216e+00;culprit=DP     GT:AD:DP:GQ:PL  1/1:0,10:10:31:451,31,0                         
1       10623   .       T       *       29176.81        PASS    AC=2;AF=1.00;AN=2;DP=10;ExcessHet=3.0103;FS=0.000;InbreedingCoeff=0.0811;MQ=47.26;MQ0=0;NEGATIVE_TRAIN_SITE;QD=34.04;SOR=9.741;VQSLOD=-2.452e+00;culprit=DP     GT:AD:DP:GQ:PL  1/1:0,10:10:31:451,31,0                  
1       10626   .       AAAGGCGCGCCGCGCCG       *       29067.03        PASS    AC=2;AF=1.00;AN=2;DP=10;ExcessHet=3.0103;FS=0.000;InbreedingCoeff=0.1081;MQ=54.99;MQ0=0;QD=32.42;SOR=9.738;VQSLOD=0.117;culprit=FS      GT:AD:DP:GQ:PL  1/1:0,10:10:31:451,31,0   

There are several records with "*" as the only ALT. Since this a single sample vcf, I wonder if this is the expected output. Assume we are only interested in this sample, then the records with only "*" as ALT seem redundant.

Running VariantEval on the above sites gave these counts:

VariantSummary  CompRod  EvalRod         JexlExpression  Novelty  nSamples  nProcessedLoci  nSNPs  TiTvRatio  SNPNoveltyRate  nSNPsPerSample  TiTvRatioPerSample  SNPDPPerSample  nIndels  IndelNoveltyRate  nIndelsPerSample  IndelDPPerSample  nSVs  SVNoveltyRate  nSVsPerSample                                                                                                                                                          
VariantSummary  dbsnp    sample  none            all             1              72      3       0.50           66.67               3                0.50             3.0        4             75.00                 4               4.0     0             NA              0           

VariantEval counted 3 SNPs and 4 indels. This seems to suggest those "*" records are also counted. Is this intended? It seems over-counted. Am I interpreted the results incorrectly?

Also, I am not sure how "*" was interpreted for TiTv:

TiTvVariantEvaluator  CompRod  EvalRod         JexlExpression  Novelty  nTi  nTv  tiTvRatio  nTiInComp  nTvInComp  TiTvRatioStandard  nTiDerived  nTvDerived  tiTvDerivedRatio                                         
TiTvVariantEvaluator  dbsnp    sample  none            all        1    2       0.50          2          1               2.00           0           0              0.00   

I appreciate any insights into this. Thanks!

Tagged:

Answers

Sign In or Register to comment.