To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

How do SelectVariants and VariantEval handle spanning deletion "*"?

Hi,

I have recently used GATK 3.5 to run joint calling (GenotypeGVCFs). After that, I used SelectVariants to select one sample and ran VariantEval with the sample's data. However, I am curious about how these two tools handle the spanning deletion alleles.

For example, this is the parameter I used for SelectVariants
java -jar GenomeAnalysisTK.jar -R ${ref} -T SelectVariants -env -sn ${sample} -V ${joint_vcf} -trimAlternates -o ${output}

In the output vcf, I observed these records

1       10397   .       CCCCTAA C       712.91  PASS    AC=2;AF=1.00;AN=2;DP=4;ExcessHet=0.5902;FS=0.927;InbreedingCoeff=0.1184;MQ=6.13;MQ0=0;NEGATIVE_TRAIN_SITE;QD=16.20;SOR=0.560;VQSLOD=-1.122e+00;culprit=DP       GT:AD:DP:GQ:PGT:PID:PL  1/1:0,4:4:12:1|1:10397_CCCCTAA_C:169,12,0  
1       10403   .       ACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC  *       3432.77 VQSRTrancheINDEL99.00to100.00   AC=2;AF=1.00;AN=2;BaseQRankSum=-2.710e-01;ClippingRankSum=0.406;DP=4;ExcessHet=29.5046;FS=13.193;InbreedingCoeff=-0.1604;MQ=24.73;MQ0=0;MQRankSum=0.868;NEGATIVE_TRAIN_SITE;QD=11.11;ReadPosRankSum=-7.200e-01;SOR=1.681;VQSLOD=-1.715e+00;culprit=DP   GT:AD:DP:GQ:PGT:PID:PL  1/1:0,4:4:12:1|1:10397_CCCCTAA_C:169,12,0    
1       10583   .       G       A       5766.70 PASS    AC=1;AF=0.500;AN=2;BaseQRankSum=-6.960e-01;ClippingRankSum=-2.620e-01;DP=19;ExcessHet=79.3063;FS=0.000;InbreedingCoeff=-0.5072;MQ=26.43;MQ0=0;MQRankSum=-1.390e-01;NEGATIVE_TRAIN_SITE;QD=5.64;ReadPosRankSum=-3.610e-01;SOR=1.792;VQSLOD=-2.703e+00;culprit=InbreedingCoeff    GT:AB:AD:DP:GQ:PL       0/1:0.790:15,4:19:75:75,0,429                                                
1       10616   .       CCGCCGTTGCAAAGGCGCGCCG  C       28746.84        PASS    AC=2;AF=1.00;AN=2;DP=10;ExcessHet=3.0103;FS=0.000;InbreedingCoeff=0.1238;MQ=23.66;MQ0=0;QD=32.93;SOR=9.720;VQSLOD=0.122;culprit=FS      GT:AD:DP:GQ:PL  1/1:0,10:10:31:451,31,0  
1       10619   .       CCGTTG  *       29067.03        PASS    AC=2;AF=1.00;AN=2;DP=10;ExcessHet=3.0103;FS=0.000;InbreedingCoeff=0.1081;MQ=53.67;MQ0=0;QD=30.83;SOR=9.738;VQSLOD=0.515;culprit=FS      GT:AD:DP:GQ:PL  1/1:0,10:10:31:451,31,0             
1       10622   .       T       *       29176.81        PASS    AC=2;AF=1.00;AN=2;DP=10;ExcessHet=3.0103;FS=0.000;InbreedingCoeff=0.0811;MQ=46.61;MQ0=0;NEGATIVE_TRAIN_SITE;QD=29.67;SOR=9.747;VQSLOD=-2.216e+00;culprit=DP     GT:AD:DP:GQ:PL  1/1:0,10:10:31:451,31,0                         
1       10623   .       T       *       29176.81        PASS    AC=2;AF=1.00;AN=2;DP=10;ExcessHet=3.0103;FS=0.000;InbreedingCoeff=0.0811;MQ=47.26;MQ0=0;NEGATIVE_TRAIN_SITE;QD=34.04;SOR=9.741;VQSLOD=-2.452e+00;culprit=DP     GT:AD:DP:GQ:PL  1/1:0,10:10:31:451,31,0                  
1       10626   .       AAAGGCGCGCCGCGCCG       *       29067.03        PASS    AC=2;AF=1.00;AN=2;DP=10;ExcessHet=3.0103;FS=0.000;InbreedingCoeff=0.1081;MQ=54.99;MQ0=0;QD=32.42;SOR=9.738;VQSLOD=0.117;culprit=FS      GT:AD:DP:GQ:PL  1/1:0,10:10:31:451,31,0   

There are several records with "*" as the only ALT. Since this a single sample vcf, I wonder if this is the expected output. Assume we are only interested in this sample, then the records with only "*" as ALT seem redundant.

Running VariantEval on the above sites gave these counts:

VariantSummary  CompRod  EvalRod         JexlExpression  Novelty  nSamples  nProcessedLoci  nSNPs  TiTvRatio  SNPNoveltyRate  nSNPsPerSample  TiTvRatioPerSample  SNPDPPerSample  nIndels  IndelNoveltyRate  nIndelsPerSample  IndelDPPerSample  nSVs  SVNoveltyRate  nSVsPerSample                                                                                                                                                          
VariantSummary  dbsnp    sample  none            all             1              72      3       0.50           66.67               3                0.50             3.0        4             75.00                 4               4.0     0             NA              0           

VariantEval counted 3 SNPs and 4 indels. This seems to suggest those "*" records are also counted. Is this intended? It seems over-counted. Am I interpreted the results incorrectly?

Also, I am not sure how "*" was interpreted for TiTv:

TiTvVariantEvaluator  CompRod  EvalRod         JexlExpression  Novelty  nTi  nTv  tiTvRatio  nTiInComp  nTvInComp  TiTvRatioStandard  nTiDerived  nTvDerived  tiTvDerivedRatio                                         
TiTvVariantEvaluator  dbsnp    sample  none            all        1    2       0.50          2          1               2.00           0           0              0.00   

I appreciate any insights into this. Thanks!

Tagged:

Answers

Sign In or Register to comment.