The current GATK version is 3.2-2

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

# version 2.7 - CombineVariants seems to jave a bug in reporting AD and PL fields?

Posts: 23Member

hello,

I am using CombineVariants to combine two multisample vcfs from made from HaplotypeCaller. The vcfs cover the same genomic region, and do not overlap in sample names. I have noticed some potential issues reporting AD and PL fields after running CombineVariants.

If the site was multiallelic in one vcf but biallelic in the other vcf, then the AD and PL fields are omitted altogether, such that GT:DP:GQ is what is reported in the combined file - I am guessing this is expected behavior?

however, if the site was biallelic in both vcfs, or monomorphic on one vcf but polymorphic in the other, the AD and PL fields are reported, but the numbers do not make sense.

my command line:

GATK -T CombineVariants -R ~/Capsella_rubella_v1.0_combined.fasta -V sc1_HC_94samp.vcf -V sc1_HC_set2.vcf -nt 20 -o sc1_HC.vcf &

original vcf record:

scaffold_1 3275 . G T 1170.18 . AC=2;AF=0.011;AN=188;BaseQRankSum=0.295;ClippingRankSum=-1.233;DP=5815;FS=1.186;InbreedingCoeff=-0.0106;MLEAC=1;MLEAF=5.319e-03;MQ=98.00;MQ0=0;MQRankSum=3.414;QD=9.36;ReadPosRankSum=1.761
GT:AD:DP:GQ:PL 0/0:59,0:59:99:0,293,4371 0/0:56,0:56:99:0,218,2973 0/0:40,0:40:99:0,161,2615 0/0:65,0:65:99:0,239,3898 0/0:62,0:62:99:0,227,3612

combined vcf record:

scaffold_1 3275 . G T 1170.18 . AC=3;AF=7.979e-03;AN=376;DP=11787;MLEAC=1;MLEAF=5.319e-03;MQ0=0;set=Intersection
GT:AD:DP:GQ:PL 0/0:0,0:59:99:0,92,0 0/0:0,42:56:99:3,5,2805 0/0:675,0:40:99:29,26,8482 0/0:45,0:65:99:43,6,11447 0/0:0,187:62:99:4079,0,0 0/0:55,0:27:99:0,182,941 0/0:0,251:69:99:0,162,2300

is this a bug? or am I missing something about how to use CombineVariants? thanks much for your help!

YW

Tagged:

• Posts: 53Member ✭✭

Hi there, I'm also joining this conversation because I noticed a behaviour in CombineVariants which I don't quite understand.

In some cases the genotype format fields seem to be combined without any omitted information, while in others - in the very same file - important fields like PL are omitted.

an example in my first set

2   91738754    rs201998549 T   C   14612.57    PASS    AC=184;AF=0.343;AN=536;BaseQRankSum=27.492;DB;DP=11129;Dels=0.00;FS=866.716;HaplotypeScore=13.1913;InbreedingCoeff=-0.5160;MLEAC=183;MLEAF=0.341;MQ=32.95;MQ0=0;MQRankSum=-11.074;QD=1.86;ReadPosRankSum=6.861;VQSLOD=-4.399e+22;culprit=DP GT:AD:DP:GQ:PL


in my second set:

2   91738754    rs201998549 T   C,TTCATC    78610.04    VQSRTrancheINDEL99.90to100.00   AC=15,253;AF=0.028,0.472;AN=536;BaseQRankSum=8.129;DB;DP=8173;FS=3200.000;InbreedingCoeff=-0.8952;MLEAC=11,255;MLEAF=0.021,0.476;MQ=37.75;MQ0=0;MQRankSum=-54.591;QD=2.05;ReadPosRankSum=4.877;VQSLOD=-1.975e+02;culprit=FS GT:AD:GQ:PL


and in the merged file

2   91738754    rs201998549 T   C,TTCATC    14612.57    PASS    AC=184,0;AF=0.343,0.00;AN=536;DB;DP=30431;Dels=0.00;HaplotypeScore=13.1913;MQ0=0;RPA=2,3;RU=TCATC;STR;set=oldCalls-filterInoldCalls-filterInnewCalls    GT:DP:GQ


in other cases in the very same file instead the information is not omitted and I have: first set:

  1 15190   rs200030104 G   A   1418.27 PASS    AC=106;AF=0.546;AN=194;BaseQRankSum=1.954;DB;DP=3223;Dels=0.00;FS=10.802;HaplotypeScore=0.2056;InbreedingCoeff=0.2387;MLEAC=105;MLEAF=0.541;MQ=7.49;MQ0=1;MQRankSum=-0.312;QD=1.77;ReadPosRankSum=-0.377;VQSLOD=-6.687e+20;culprit=QD   GT:AD:DP:GQ:PL


second set:

 1  15190   rs200030104 G   A   1598.41 PASS    AC=105;AF=0.547;AN=192;BaseQRankSum=1.932;DB;DP=110;FS=3.903;InbreedingCoeff=0.3270;MLEAC=106;MLEAF=0.552;MQ=24.58;MQ0=0;MQRankSum=-0.717;QD=24.98;ReadPosRankSum=-0.466;VQSLOD=45.70;culprit=DP    GT:AD:GQ:PL


and merged record:

1   15190   rs200030104 G   A   1418.27 PASS    AC=106;AF=0.546;AN=194;DB;DP=3333;Dels=0.00;HaplotypeScore=0.2056;set=Intersection  GT:AD:DP:GQ:PL


I understand that for some annotations in the INFO field I might need to re-run the VariantAnnotator on the merged file, that's understandable. But how about the GQ and PL fields? And how should I handle a file where for some variants that information is retained and for others is not?

I also didn't understand much the set set=oldCalls-filterInoldCalls-filterInnewCalls because in my "oldCalls" the variant is PASS and in the "newCalls" is in a tranche, I would have expected a set like set=oldCalls-filterInnewCalls.

thanks, Francesco

• Posts: 53Member ✭✭

Hi @Geraldine_VdAuwera, no multithreading here. the exact command I used is:

    java -Djava.io.tmpdir=/project/variantCalling/tmp -Xmx8g -Xms6g \
-jar /com/extra/GATK/2.7-2/jar-bin/GenomeAnalysisTK.jar \
-T CombineVariants \
-R /home/lescai/resources/GATKbundle/2.5/b37/human_g1k_v37.fasta \
--variant:oldCalls oldCalls/WGS.recalibrated.vcf \
--variant:newCalls newCalls/WGS.recalibrated.vcf \
-o  WGS.recalibrated.merged.vcf \
-genotypeMergeOptions UNSORTED


and version of GATK is therefore 2.7-2-g6bda569 thanks!

• Posts: 23Member

hi Geraldine,

I had previously used nt - I just tried combining the same file singlethreaded, and the problem with AD and PL disappears! For biallelic sites it is now reporting the AD and PL values from the original files. thanks!

and thank you Eric for the clarification re multi allelic sites.

YW

• Posts: 53Member ✭✭

Hi @ebanks, thanks for your reply. it perfectly makes sense when the alleles (and consequently the genotypes maybe) are not the same.

Since you might not want to loose information, however, is there a way to output the two calls from different files as different variants, when they have the same position but the genotypes are discordant? I'm not sure if this would be compatible with VCF format though.

Or possibly print on a separate file the two calls when they are discordant. I just would like to handle my data in a merged file, but of course be aware when they're not the same and not loose information. PL field is particularly important when doing association analysis of course, so would prefer not to loose it :-)

cheers, Francesco

• Posts: 679GATK Developer mod

Hi Francesco,

I've updated SelectVariants so that it no longer strips out the PLs or AD (instead it fixes them), but this isn't currently possible with CombineVariants. I understand your desire to use the PLs, but the problem is that they are no longer accurate in the context of new alternate alleles. Perhaps one of your suggested solutions could work, but we don't have the resources to implement them right now (however, we are always willing to accept contributions from our user community!). In the meantime, you should consider calling all of your samples jointly if possible - then you won't have to combine them ad hoc at the end. We are also working on another approach using the reference calculation model in the Haplotype Caller that will allow users to do single sample "calling" with a smart joint join step at the end. Stay tuned!

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 20GATK Developer mod

Hi @leeyoungwha,

I'm trying to reproduce the issue with -nt, but in our test cases I'm not having luck recreating different outputs with and without -nt.

If you and @flescai can share your input vcfs that produce errors, along with log files produced with -log log_file.txt, that would help us immensely in terms of narrowing down the issue. After you upload the files, if you could email kshakir [at] broadinstitute.org with the names of files to look for, I'll go through the log files to find and run the commands on our servers and get back to you asap.

See this post for how to upload data to our write-only ftp directory: http://gatkforums.broadinstitute.org/discussion/1215/how-can-i-access-the-gsa-public-ftp-server

Thanks!

• Posts: 53Member ✭✭

Hi @ebanks thanks for that. I was also thinking that maybe, in the case of multiple alleles and discordant genotypes for the same samples, one might prefer to give priority to one of the call sets. The current behaviour of the genotypeMerge PRIORITY option is to select the genotypes based on the priority but merge the ALT alleles (if I understand correctly the output in my files). What about simply deciding which of the call sets to retain completely? In this case you could also keep the PL and AD fields, because they will come from a specific call/caller.

As a concrete example in my case I have two call sets performed on the same samples by UnifiedGenotyper first and HaplotypeCaller after. We might discuss if that's appropriate or not, but I would like to keep the data form both and - in case of discordant alleles or genotypes - give preference to HaplotypeCaller. How would you suggest to do that?

and @kshakir thanks for looking into the specific. I need to relaunch the command, because I didn't dump the log file. as soon as it's done I'll upload the files. thanks.

• Posts: 679GATK Developer mod

That's a very good suggestion that works well for the case where all input VCFs have the same set of samples - and I think you can do that already without having to change the code at all. It would require 2 steps: 1. Run -T SelectVariants -V UG.calls.vcf -XL HC.calls.vcf -o UG.unique.calls.vcf (this gives you calls found only in the UG set) 2. Run -T CombineVariants -V HC.calls.vcf -V UG.unique.calls.vcf -o both.calls.vcf (so you get all of the HC calls and genotypes plus those from UG that aren't in the HC set) Does that basically do what you described above?

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 53Member ✭✭

thanks @ebanks ! I think that responds to what I'd like to do. I'll follow your suggestion, and report here if there's anything I notice in the outputs. thanks again!