Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

CombineVariants in GATK4

Is it planned to add CombineVariants tool into GATK4.0 toolkit (it existed in previous GATK versions)? The only similar tool currently available in GATK4.0 Beta is GatherVCFs which has very limited possibility and cannot concatenate unsorted VCFs or merge different INFO fields correctly.
Thanks! :)

Tagged:

Best Answers

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Vladimir_Kovacevic
    Hi.

    There is a tool called MergeVcfs which you can use instead of CombineVariants. It looks like there is no documentation for it yet, but if you use --list with gatk-launch, it lists the available tools. We should have better documentation within the coming months when GATK4 is out of beta.

    -Sheila

  • Hi @Sheila!
    Thank you for your response and suggestion. We tried MergeVcfs and unfortunately it failed with two VCFs that pass with CombineVariants. Here is the error:
    Using GATK jar /GATK/gatk-4.beta.2/gatk-package-4.beta.2-local.jar
    Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Dsnappy.disable=true -jar /GATK/gatk-4.beta.2/gatk-package-4.beta.2-local.jar MergeVcfs --input reheadered_subset.vcf --input tp.fp.subset.vcf --output output.vcf
    12:00:35.965 INFO NativeLibraryLoader - Loading libgkl_compression.dylib from jar:file:/GATK/gatk-4.beta.2/gatk-package-4.beta.2-local.jar!/com/intel/gkl/native/libgkl_compression.dylib
    [September 19, 2017 12:00:35 PM CEST] MergeVcfs --input reheadered_subset.vcf --input tp.fp.subset.vcf --output output.vcf --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 1 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX true --CREATE_MD5_FILE false --help false --version false --showHidden false --verbosity INFO --QUIET false --use_jdk_deflater false --use_jdk_inflater false
    [September 19, 2017 12:00:35 PM CEST] Executing as [email protected] on Mac OS X 10.12.6 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_121-b13; Version: 4.beta.2
    12:00:41.015 INFO MergeVcfs - HTSJDK Defaults.COMPRESSION_LEVEL : 1
    12:00:41.016 INFO MergeVcfs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    12:00:41.016 INFO MergeVcfs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    12:00:41.016 INFO MergeVcfs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    12:00:41.016 INFO MergeVcfs - Deflater: IntelDeflater
    12:00:41.016 INFO MergeVcfs - Inflater: IntelInflater
    12:00:41.016 INFO MergeVcfs - Initializing engine
    12:00:41.016 INFO MergeVcfs - Done initializing engine
    12:00:41.573 INFO MergeVcfs - Processed 10,000 records. Elapsed time: 00:00:00s. Time for last 10,000: 0s. Last read position: 3:189,995,416
    12:00:41.785 INFO MergeVcfs - Processed 20,000 records. Elapsed time: 00:00:00s. Time for last 10,000: 0s. Last read position: 8:132,077,728
    12:00:41.957 INFO MergeVcfs - Shutting down engine
    [September 19, 2017 12:00:41 PM CEST] org.broadinstitute.hellbender.tools.picard.vcf.MergeVcfs done. Elapsed time: 0.10 minutes.
    Runtime.totalMemory()=352845824
    java.lang.IllegalStateException: The elements of the input Iterators are not sorted according to the comparator htsjdk.variant.variantcontext.VariantContextComparator
    at htsjdk.samtools.util.MergingIterator.next(MergingIterator.java:113)
    at org.broadinstitute.hellbender.tools.picard.vcf.MergeVcfs.doWork(MergeVcfs.java:126)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:115)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:170)
    at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgram.instanceMain(PicardCommandLineProgram.java:62)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:131)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:152)
    at org.broadinstitute.hellbender.Main.main(Main.java:230)

    Do you have any more suggestions?

    FYI @teodora_aleksic

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Vladimir_Kovacevic
    Hi,

    Hmm. Can you confirm the VCFs pass ValidateVariants?

    Also, can you try deleting the VCF indices and re-generating them?

    Thanks,
    Sheila

  • said3427said3427 MexicoMember
    edited September 2017

    I am moving to GATK4 and had the same question. It worked for me :smile:

    Thank you,
    Said MM

  • mjtivmjtiv Newark, DEMember
    edited April 2018

    It appears the MergeVcfs is built on top of Picard (GATK 4.0 does mentions this too). So, if you run into issues with this command go to Picard and look at what they recommend to do (files must be sorted the same, output file has a file type designated (vcf etc). Just skimming the error message above I think the error is caused by the files not being sorted the same.

    Here is a similar command using straight Picard

    java -jar picard MergeVcfs \
    I=sample_8_filtered_raw_SNPs.vcf \
    I=filtered_sample_8_raw_indels.vcf \
    O=combined_Filtered_Variants_4-4-2018.vcf

  • hdeteringhdetering Vigo, SpainMember

    While Picard's MergeVcfs can be used to combine variants from VCFs containing the same samples, this covers only one mode (UNION) of two (MERGE, UNION) that CombineVariants offered.

    The MERGE mode is more akin to bcftools merge, combining info about variant loci from multiple samples, each contained in their individual VCFs.

    Is there a tool in GATK4 that covers the MERGE mode? (I would like to use it to combine results from MuTect for multiple samples.)

    Cheers,

    • Harry
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @hdetering
    Hi Harry,

    You should be able to use CombineVariants to merge VCFs from different samples. Is there a reason you are against using it?

    -Sheila

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    Is there a tool in GATK4 that covers the MERGE mode? (I would like to use it to combine results from MuTect for multiple samples.)

    @hdetering Apologies for being nosy but implementing a multi-sample mode or workflow is one of our top priorities for Mutect2 in the coming months. What features would this have to have to suit your needs?

  • FPBarthelFPBarthel HoustonMember ✭✭
    edited August 2018

    Is there a specific reason why CombineVariants is not in GATK 4? I am also interested in a CombineVariants port to GATK 4 (eg. for combining variants of the same sample from different callers for side-by-side comparison of variants), in addition to a multi-sample workflow for Mutect2 (for genotyping the same variant across multiple samples)

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @FPBarthel
    Hi,

    There is MergeVCFs for your first case. For the second (combining different sample VCFs), I think the team discourages this because we have the GVCF workflow and don't recommend merging single sample VCFs.

    -Sheila

  • FPBarthelFPBarthel HoustonMember ✭✭

    Hi @Sheila,

    Admittedly I have not tested this but from the documentation MergeVCFs does not have a feature that allows for variant intersection? Eg. supply MergeVCFs with three input VCF files and have it output only variants present in at least 2/3 input files? Something like this? That would be what I am looking for for the first case.

    The second case would be to take the union of multiple VCF files from different samples, discarding the FORMAT columns, followed by a second step of genotyping the variant set across multiple samples, thus reintroducing the FORMAT column for multiple samples.

    Floris

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @FPBarthel
    Hi Floris,

    Eg. supply MergeVCFs with three input VCF files and have it output only variants present in at least 2/3 input files?

    In this case, you would have to first merge the VCFs, then use SelectVariants to select the sites of interest.

    -Sheila

  • Hi @Sheila,

    I am a bit confused, after following the GATK Best Practices and reaching till here https://software.broadinstitute.org/gatk/documentation/article.php?id=2806, obtaining the INDEL.vcf and SNP.vcf from a sample, how should we continue, analyzing each separately or could we combine them using MergeVCFs?

    My intentions are annotations and GenotypePosteriori and VariantsToTable.
    PS: (I am using WES).

    Best,
    Marius.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    If you want to proceed with CalculateGenotypePosteriors you would want to combine the SNP and indel vcfs with MergeVcfs. If you're using VariantsToTable it depends on whether having separate tables for SNPs and indels is more convenient.

  • Thanks, @davidben it worked with MergeVcfs perfectly :smiley:

  • kmmahankmmahan Member

    Why do we need to merge vcfs? Do we have to or can we keep them separate?

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @kmmahan At this point we're talking about analyses downstream of the GATK. Sometimes it will be convenient to keep SNPs and indels together; sometimes it won't. I will say, however, that if the variants do somehow end up as the input to a GATK tool, it would most likely expect a single merged vcf. That's just the usual GATK style.

  • nketchinketchi Member

    Hi everyone,

    I am faced with a similar problem. Having gotten to the point where I am happy with my vcf file filtering, I would like to use CombineVariants to select only SNPs which occur across all of my populations. Before with gatk3, we did it like this:

    gatk --java-options "-Xmx10g" -l INFO -T CombineVariants -R $reference \
    --variant $maxmiss_filtered.pop1.recode.vcf \
    --variant $maxmiss_filtered.pop2.recode.vcf \
    --variant $maxmiss_filtered.pop3.recode.vcf \
    --variant $maxmiss_filtered.pop4.recode.vcf \
    --variant $maxmiss_filtered.pop5.recode.vcf \
    --variant $maxmiss_filtered.pop6.recode.vcf \
    --minimumN 6 -o $maxmiss_filtered.merged.vcf

    Can someone please tell me which steps I would have to take using mergevcf and selectvariants to achieve a similar result? (using gatk4)

    I need a merged vcf across my populations for all downstream analyses!

    I am still wondering why this tool was removed? (it was so useful!)

    Thanks,

    Josie

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @nketchi

    1) We are discussing with the developers why CombineVariants has not been ported to GATK4 and I will get back to you with an answer.

    2)If it helps your case, you can continue using CombineVariants from GATK3 in the mean while.

    Regards
    Bhanu

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    Accepted Answer

    Hi @nketchi

    The developers got back to me and mentioned that porting CombineVariants to GTAK4 is a work in progress. I have let them know that this is a feature that the users wants and hence will be prioritized.
    For now you can use it from GATK3.
    I hope this helps.

    Regards
    Bhanu

  • Vladimir_KovacevicVladimir_Kovacevic Member
    Accepted Answer

    Thank you very much, @bhanuGandham !

  • Hopefully this is done fast... I would like to use CombineVariants in GATK4 as well.

  • aschoenraschoenr Member

    Just thought I'd add my vote to bring CombineVariants into GATK4. Will be using GATK3 for this alone until then.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @aschoenr @Vladimir_Kovacevic @nketchi @hdetering @Rosmaninho @FPBarthel

    Thank you all for bringing this to our notice, I am trying to see if our team can either port CombineVariants or have those features included in MergeVcfs in GATK4.

    What would be very helpful is you could tell me some specific features that are useful to you from CombineVariants that are not possible with MergeVcfs ? This information will be very helpful. Thank you in advance.

    Regards
    Bhanu

  • Union of variants from two or more somatic callers (Tumor-Normal VCF), where tumor normal pair are from the same person. If I were you I would would test different set of mostly used, for example Strelka, Vardict, Mutect2, Varscan, Somatic Sniper, Rufus etc.

  • I did not even use Mergevcfs because I I didn't want to Mergevcfs.
    I used CombineVcfs because of the -minN option where it only outputs sites observed in a minimum number of samples from a certain number of samples.

    For example, I have a set of 10 samples with a phenotype in common and I want a vcf file with the sites that are present in a minimum of 5 of those 10 samples.
    This way I got a vcf with variants that are present in a significant number of my samples and might be significant but I am not being extremely restrictive and asking for variants present in all my samples.

  • @Rosmaninho, I did not even know about this feature of CombineVariants. Well, what can I say except, it is one hell of a tool!

  • Well, I was looking specifically for a tool to do this so i found it in CombineVariants...
    If you were looking for this I'm sure you would find about it as well. :P

  • jjmmiijjmmii Member

    I need CombineVariants to merge two VCFs with priority. Seems this can't be done in MergeVcfs, so I'm adding my vote for GATK4 CombineVariants as well. Thanks so much to the developers.

  • aschoenraschoenr Member

    Hi @bhanuGandham, sorry, I didn't see this until now. I needed to use CombineVariants in my implementation of the Germline short variant discovery workflow. Specifically, I implemented the joint genotyping in parallel, processing each chromosome individually. Since I needed a single VCF for VQSR, I needed to combine all of the VCFs I produced. I tried another method and I ended up with far too many entries (columns) in the resulting VCF. Since the samples were the same (and in the same order) in all of my VCFs, I think I just needed to basically concatenate the files together while maintaining the header, and CombineVariants was the tool I found mentioned on the GATK forums to do this. Please let me know if I can achieve this with another tool in GATK4. Like I said, this is probably the simplest case in which VCFs need to be combined.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Thanks for all the feedback!

  • xiuczxiucz Member

    @bhanuGandham

    I need it also, for example, if I want to combine vcf from freebayes and vcf from vqsr or vcf from streakle2 together, CombineVariants tool will be helpful.

    Thank you.

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    Hi @xiucz We are tracking this issue, and it would be helpful to find out which feature of CombineVariants is needed. In other words, what do you need that MergeVcfs does not currently provide. This information can be used to improve the next version, the GATK3 version was sometimes inconsistent in performance and that is why it was not ported over in its entirety.

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭
    edited January 12

    Those 2 tools don't even work the same way.

    When 2 vcfs from different callers were merged using mergevcfs same positions are not handled as one but CombineVariants can handle this scenario using uniquify option

    An alternative could be using bcftools however it messes up with FORMAT fields in the final output when 2 callers have different FORMAT options.

    I guess everybody needs this feature to be ported to GATK4 and picard. Maintaining an older GATK version in workflows is really getting confusing sometimes.

    Here is an example. One vcf is from HaplotypeCaller and the other is from UnifiedGenotyper from the same sample.

    MergeVcfs output for the same position

    19  49621423    .   C   T   805.77  PASS    AC=2;AF=1.00;AN=2;BaseQRankSum=2.417;DP=25;Dels=0.00;ExcessHet=3.0103;FS=0.000;HaplotypeScore=0.8941;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;MQRankSum=0.000;QD=32.23;ReadPosRankSum=-1.103;SOR=0.776;set=variant GT:AD:DP:GQ:PL  1/1:2,23:25:63:834,63,0
    19  49621423    .   C   T   833.77  PASS    AC=2;AF=1.00;AN=2;BaseQRankSum=1.769;DP=24;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQRankSum=0.000;QD=34.74;ReadPosRankSum=-0.795;SOR=0.353;set=variant   GT:AD:DP:GQ:PL  1/1:1,23:24:62:862,62,0
    

    And here is the output from CombineVariants with uniquify option for the same position

    19  49621423    .   C   T   833.77  PASS    AC=4;AF=1.00;AN=4;DP=49;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQRankSum=0.000;set=Intersection  GT:AD:DP:GQ:PL  1/1:1,23:24:62:862,62,0 1/1:2,23:25:63:834,63,0
    

    Finally here is the output from bcftools merge

    19  49621423    .   C   T   833.77  PASS    BaseQRankSum=1.769;ExcessHet=3.0103;FS=0;MQ=60;MQRankSum=0;QD=34.74;ReadPosRankSum=-0.795;SOR=0.353;set=variant;Dels=0;HaplotypeScore=0.8941;MQ0=0;DP=49;AF=1;MLEAC=2;MLEAF=1;AN=4;AC=4 GT:AD:DP:GQ:PL  1/1:1,23:24:62:862,62,0 1/1:2,23:25:63:834,63,0
    

    CombineVariants is also not perfect and looks like missing INFO fields that conflict for values however since this step is usually done after individual VCFs are filtered it is not much of a problem in the final output.

    This may be viewed as a non-canonical and even non-standard way of handling vcf files for HTS however this particular practice is quite common when you are using multiple callers and try not to loose potentially important variants due to algorithmic differences.

    Post edited by SkyWarrior on
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @SkyWarrior

    Thank you for the input and we are working on finding a solution for this. Your input is very helpful.

  • xiuczxiucz Member

    @SkyWarrior Thanks for your comment, and it is really helpful. Let us wait for the new features of this tool.
    @bhanuGandham And thank you and GATK team. I think it is more reasonable to combine different vcfs from different callers at vcf-level.

  • xiuczxiucz Member

    @SkyWarrior, here is an example command, any guidance is greatly appreciated.

    java -jar ~/GenomeAnalysisTK.jar -T CombineVariants \
    -R ~/database/hg19/gatk_bundle/ucsc.hg19.fasta \
    -genotypeMergeOptions PRIORITIZE \
    -priority Mutect2,varscan,strelka \
    -V:varscan ~/combineanalysis/varscan.vcf \
    -V:strelka ~/combineanalysis/strelka.vcf \
    -V:Mutect2 ~/combineanalysis/mutect2_twicefiltered.vcf \
    -o ~/merged.vcf.gz
    
  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    @xiucz in your case you may need to use UNIQUIFY option to see all callers' results in a single line for one site.

  • xiuczxiucz Member

    Hi,
    @SkyWarrior thanks for your promotion advice. Strangely, I didn't get any information in the second FORMAT column, the sample.freebayes column only contains "./." .

    #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  sample.VQSR     sample.freebayes
    chr1    13116   .   T   G   97.46   VQSRTrancheSNP99.00to99.90  AC=1;AF=0.500;AN=2;BaseQRankSum=2.64;ClippingRankSum=0.00;DP=10;ExcessHet=4.7712;FS=4.715;MQ=35.83;MQRankSum=-2.402e+00;QD=3.05;ReadPosRankSum=-2.640e-01;SOR=0.720;VQSLOD=-3.171e+00;culprit=MQ;set=FilteredInAll  GT:AD:DP:GQ:PGT:PID:PL  0/1:7,3:10:99:0|1:13116_T_G:105,0,393  ./.
    
    comand1
    java -jar ~/GenomeAnalysisTK.jar -T CombineVariants \
    -R ~/database/hg19/gatk_bundle/ucsc.hg19.fasta \
    -genotypeMergeOptions UNIQUIFY \
    -V:freebayes freebayes.filter.results.vcf \
    -V:VQSR VQSR.final.vcf \
    -o tmp.vcf
    

    After regexing the INFO column, I got these set tags:

    set=FilteredInAll
    set=filterInVQSR-freebayes
    set=freebayes
    set=Intersection
    set=VQSR
    

    set=FilteredInAll means union, set=freebayes and set=VQSR means the variant is only in their own dataset. What does the "set=filterInVQSR-freebayes" tag mean?

    #freebayes result
    chr1    14930   .   A   G   287.276 .   AB=0.26087;ABP=48.7056;AC=1;AF=0.5;AN=2;AO=24;CIGAR=1X;DP=92;DPB=92;DPRA=0;EPP=6.26751;EPPR=4.1599;GTI=0;LEN=1;MEANALT=1;MQM=46;MQMR=37.1029;NS=1;NUMALT=1;ODDS=66.1478;PAIRED=1;PAIREDR=0.985294;PAO=0;PQA=0;PQR=0;PRO=0;QA=816;QR=2277;RO=68;RPL=13;RPP=3.37221;RPPR=15.7837;RPR=11;RUN=1;SAF=8;SAP=8.80089;SAR=16;SRF=25;SRP=13.3567;SRR=43;TYPE=snp;technology.ILLUMINA=1   GT:DP:AD:RO:QR:AO:QA:GL 0/1:92:68,24:68:2277:24:816:-39.8086,0,-136.39
    
    #vqsr result
    chr1    14930   .   A   G   1793.44 VQSRTrancheSNP99.00to99.90  AC=1;AF=0.500;AN=2;BaseQRankSum=1.44;ClippingRankSum=0.00;DP=77;ExcessHet=4.7712;FS=0.000;MQ=48.52;MQRankSum=1.78;NEGATIVE_TRAIN_SITE;QD=7.15;ReadPosRankSum=1.05;SOR=0.667;VQSLOD=-2.914e+00;culprit=MQ    GT:AD:DP:GQ:PL  0/1:55,22:77:99:462,0,1578
    

    Thank you.

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Your combined result is from a different position. Can you check the proper position that you posted in the bottom?

  • xiuczxiucz Member

    @SkyWarrior
    Sorry, I posted the combined result wrongly , it looks fine now. Thank you.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    @SkyWarrior Thank you for helping out with this issue. We appreciate your help.

  • alanhoylealanhoyle UNC Lineberger Member

    I was running into this, largely because the Mutect2 documentation that shows up first on Google is the documentation for the GATK 3.8 version. for the current 4.x version, we instead need to be running gatk CreateSomaticPanelOfNormals which appears to replace the use of CombineVariants for that purpose.

    Mentioning that here might help others in the future.

  • olavurolavur Member

    Any news on getting CombineVariants working in GATK4?

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited March 11

    @olavur

    We will not be porting CombineVariants to GATK4(for now).

  • @bhanuGandham ,
    you said in October 2018:
    "The developers got back to me and mentioned that porting CombineVariants to GTAK4 is a work in progress. I have let them know that this is a feature that the users wants and hence will be prioritized.
    For now you can use it from GATK3.
    I hope this helps.".
    What happen? Is the CombineVariants that tough nut to crack? :)

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    HI @Vladimir_Kovacevic

    We have hit some roadblocks with that. We will let GATK users know when we know more. Currently the team is focused on other goals.

Sign In or Register to comment.