Hi GATK Users,

Happy Thanksgiving!
Our staff will be observing the holiday and will be unavailable from 22nd to 25th November. This will cause a delay in reaching out to you and answering your questions immediately. Rest assured we will get back to it on Monday November 26th. We are grateful for your support and patience.
Have a great holiday everyone!!!

Regards
GATK Staff

CombineVariants in GATK4

Is it planned to add CombineVariants tool into GATK4.0 toolkit (it existed in previous GATK versions)? The only similar tool currently available in GATK4.0 Beta is GatherVCFs which has very limited possibility and cannot concatenate unsorted VCFs or merge different INFO fields correctly.
Thanks! :)

Tagged:

Best Answers

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Vladimir_Kovacevic
    Hi.

    There is a tool called MergeVcfs which you can use instead of CombineVariants. It looks like there is no documentation for it yet, but if you use --list with gatk-launch, it lists the available tools. We should have better documentation within the coming months when GATK4 is out of beta.

    -Sheila

  • Hi @Sheila!
    Thank you for your response and suggestion. We tried MergeVcfs and unfortunately it failed with two VCFs that pass with CombineVariants. Here is the error:
    Using GATK jar /GATK/gatk-4.beta.2/gatk-package-4.beta.2-local.jar
    Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Dsnappy.disable=true -jar /GATK/gatk-4.beta.2/gatk-package-4.beta.2-local.jar MergeVcfs --input reheadered_subset.vcf --input tp.fp.subset.vcf --output output.vcf
    12:00:35.965 INFO NativeLibraryLoader - Loading libgkl_compression.dylib from jar:file:/GATK/gatk-4.beta.2/gatk-package-4.beta.2-local.jar!/com/intel/gkl/native/libgkl_compression.dylib
    [September 19, 2017 12:00:35 PM CEST] MergeVcfs --input reheadered_subset.vcf --input tp.fp.subset.vcf --output output.vcf --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 1 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX true --CREATE_MD5_FILE false --help false --version false --showHidden false --verbosity INFO --QUIET false --use_jdk_deflater false --use_jdk_inflater false
    [September 19, 2017 12:00:35 PM CEST] Executing as [email protected] on Mac OS X 10.12.6 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_121-b13; Version: 4.beta.2
    12:00:41.015 INFO MergeVcfs - HTSJDK Defaults.COMPRESSION_LEVEL : 1
    12:00:41.016 INFO MergeVcfs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    12:00:41.016 INFO MergeVcfs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    12:00:41.016 INFO MergeVcfs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    12:00:41.016 INFO MergeVcfs - Deflater: IntelDeflater
    12:00:41.016 INFO MergeVcfs - Inflater: IntelInflater
    12:00:41.016 INFO MergeVcfs - Initializing engine
    12:00:41.016 INFO MergeVcfs - Done initializing engine
    12:00:41.573 INFO MergeVcfs - Processed 10,000 records. Elapsed time: 00:00:00s. Time for last 10,000: 0s. Last read position: 3:189,995,416
    12:00:41.785 INFO MergeVcfs - Processed 20,000 records. Elapsed time: 00:00:00s. Time for last 10,000: 0s. Last read position: 8:132,077,728
    12:00:41.957 INFO MergeVcfs - Shutting down engine
    [September 19, 2017 12:00:41 PM CEST] org.broadinstitute.hellbender.tools.picard.vcf.MergeVcfs done. Elapsed time: 0.10 minutes.
    Runtime.totalMemory()=352845824
    java.lang.IllegalStateException: The elements of the input Iterators are not sorted according to the comparator htsjdk.variant.variantcontext.VariantContextComparator
    at htsjdk.samtools.util.MergingIterator.next(MergingIterator.java:113)
    at org.broadinstitute.hellbender.tools.picard.vcf.MergeVcfs.doWork(MergeVcfs.java:126)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:115)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:170)
    at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgram.instanceMain(PicardCommandLineProgram.java:62)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:131)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:152)
    at org.broadinstitute.hellbender.Main.main(Main.java:230)

    Do you have any more suggestions?

    FYI @teodora_aleksic

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Vladimir_Kovacevic
    Hi,

    Hmm. Can you confirm the VCFs pass ValidateVariants?

    Also, can you try deleting the VCF indices and re-generating them?

    Thanks,
    Sheila

  • said3427said3427 MexicoMember
    edited September 2017

    I am moving to GATK4 and had the same question. It worked for me :smile:

    Thank you,
    Said MM

  • mjtivmjtiv Newark, DEMember
    edited April 4

    It appears the MergeVcfs is built on top of Picard (GATK 4.0 does mentions this too). So, if you run into issues with this command go to Picard and look at what they recommend to do (files must be sorted the same, output file has a file type designated (vcf etc). Just skimming the error message above I think the error is caused by the files not being sorted the same.

    Here is a similar command using straight Picard

    java -jar picard MergeVcfs \
    I=sample_8_filtered_raw_SNPs.vcf \
    I=filtered_sample_8_raw_indels.vcf \
    O=combined_Filtered_Variants_4-4-2018.vcf

  • hdeteringhdetering Vigo, SpainMember

    While Picard's MergeVcfs can be used to combine variants from VCFs containing the same samples, this covers only one mode (UNION) of two (MERGE, UNION) that CombineVariants offered.

    The MERGE mode is more akin to bcftools merge, combining info about variant loci from multiple samples, each contained in their individual VCFs.

    Is there a tool in GATK4 that covers the MERGE mode? (I would like to use it to combine results from MuTect for multiple samples.)

    Cheers,

    • Harry
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @hdetering
    Hi Harry,

    You should be able to use CombineVariants to merge VCFs from different samples. Is there a reason you are against using it?

    -Sheila

  • davidbendavidben BostonMember, Broadie, Dev ✭✭

    Is there a tool in GATK4 that covers the MERGE mode? (I would like to use it to combine results from MuTect for multiple samples.)

    @hdetering Apologies for being nosy but implementing a multi-sample mode or workflow is one of our top priorities for Mutect2 in the coming months. What features would this have to have to suit your needs?

  • FPBarthelFPBarthel HoustonMember
    edited August 16

    Is there a specific reason why CombineVariants is not in GATK 4? I am also interested in a CombineVariants port to GATK 4 (eg. for combining variants of the same sample from different callers for side-by-side comparison of variants), in addition to a multi-sample workflow for Mutect2 (for genotyping the same variant across multiple samples)

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @FPBarthel
    Hi,

    There is MergeVCFs for your first case. For the second (combining different sample VCFs), I think the team discourages this because we have the GVCF workflow and don't recommend merging single sample VCFs.

    -Sheila

  • FPBarthelFPBarthel HoustonMember

    Hi @Sheila,

    Admittedly I have not tested this but from the documentation MergeVCFs does not have a feature that allows for variant intersection? Eg. supply MergeVCFs with three input VCF files and have it output only variants present in at least 2/3 input files? Something like this? That would be what I am looking for for the first case.

    The second case would be to take the union of multiple VCF files from different samples, discarding the FORMAT columns, followed by a second step of genotyping the variant set across multiple samples, thus reintroducing the FORMAT column for multiple samples.

    Floris

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @FPBarthel
    Hi Floris,

    Eg. supply MergeVCFs with three input VCF files and have it output only variants present in at least 2/3 input files?

    In this case, you would have to first merge the VCFs, then use SelectVariants to select the sites of interest.

    -Sheila

  • Hi @Sheila,

    I am a bit confused, after following the GATK Best Practices and reaching till here https://software.broadinstitute.org/gatk/documentation/article.php?id=2806, obtaining the INDEL.vcf and SNP.vcf from a sample, how should we continue, analyzing each separately or could we combine them using MergeVCFs?

    My intentions are annotations and GenotypePosteriori and VariantsToTable.
    PS: (I am using WES).

    Best,
    Marius.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭

    If you want to proceed with CalculateGenotypePosteriors you would want to combine the SNP and indel vcfs with MergeVcfs. If you're using VariantsToTable it depends on whether having separate tables for SNPs and indels is more convenient.

  • Thanks, @davidben it worked with MergeVcfs perfectly :smiley:

  • kmmahankmmahan Member

    Why do we need to merge vcfs? Do we have to or can we keep them separate?

  • davidbendavidben BostonMember, Broadie, Dev ✭✭

    @kmmahan At this point we're talking about analyses downstream of the GATK. Sometimes it will be convenient to keep SNPs and indels together; sometimes it won't. I will say, however, that if the variants do somehow end up as the input to a GATK tool, it would most likely expect a single merged vcf. That's just the usual GATK style.

  • nketchinketchi Member

    Hi everyone,

    I am faced with a similar problem. Having gotten to the point where I am happy with my vcf file filtering, I would like to use CombineVariants to select only SNPs which occur across all of my populations. Before with gatk3, we did it like this:

    gatk --java-options "-Xmx10g" -l INFO -T CombineVariants -R $reference \
    --variant $maxmiss_filtered.pop1.recode.vcf \
    --variant $maxmiss_filtered.pop2.recode.vcf \
    --variant $maxmiss_filtered.pop3.recode.vcf \
    --variant $maxmiss_filtered.pop4.recode.vcf \
    --variant $maxmiss_filtered.pop5.recode.vcf \
    --variant $maxmiss_filtered.pop6.recode.vcf \
    --minimumN 6 -o $maxmiss_filtered.merged.vcf

    Can someone please tell me which steps I would have to take using mergevcf and selectvariants to achieve a similar result? (using gatk4)

    I need a merged vcf across my populations for all downstream analyses!

    I am still wondering why this tool was removed? (it was so useful!)

    Thanks,

    Josie

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    Hi @nketchi

    1) We are discussing with the developers why CombineVariants has not been ported to GATK4 and I will get back to you with an answer.

    2)If it helps your case, you can continue using CombineVariants from GATK3 in the mean while.

    Regards
    Bhanu

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin
    Accepted Answer

    Hi @nketchi

    The developers got back to me and mentioned that porting CombineVariants to GTAK4 is a work in progress. I have let them know that this is a feature that the users wants and hence will be prioritized.
    For now you can use it from GATK3.
    I hope this helps.

    Regards
    Bhanu

  • Vladimir_KovacevicVladimir_Kovacevic Member
    Accepted Answer

    Thank you very much, @bhanuGandham !

  • RosmaninhoRosmaninho Member

    Hopefully this is done fast... I would like to use CombineVariants in GATK4 as well.

  • aschoenraschoenr Member

    Just thought I'd add my vote to bring CombineVariants into GATK4. Will be using GATK3 for this alone until then.

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    Hi @aschoenr @Vladimir_Kovacevic @nketchi @hdetering @Rosmaninho @FPBarthel

    Thank you all for bringing this to our notice, I am trying to see if our team can either port CombineVariants or have those features included in MergeVcfs in GATK4.

    What would be very helpful is you could tell me some specific features that are useful to you from CombineVariants that are not possible with MergeVcfs ? This information will be very helpful. Thank you in advance.

    Regards
    Bhanu

  • Union of variants from two or more somatic callers (Tumor-Normal VCF), where tumor normal pair are from the same person. If I were you I would would test different set of mostly used, for example Strelka, Vardict, Mutect2, Varscan, Somatic Sniper, Rufus etc.

  • RosmaninhoRosmaninho Member

    I did not even use Mergevcfs because I I didn't want to Mergevcfs.
    I used CombineVcfs because of the -minN option where it only outputs sites observed in a minimum number of samples from a certain number of samples.

    For example, I have a set of 10 samples with a phenotype in common and I want a vcf file with the sites that are present in a minimum of 5 of those 10 samples.
    This way I got a vcf with variants that are present in a significant number of my samples and might be significant but I am not being extremely restrictive and asking for variants present in all my samples.

  • @Rosmaninho, I did not even know about this feature of CombineVariants. Well, what can I say except, it is one hell of a tool!

  • RosmaninhoRosmaninho Member

    Well, I was looking specifically for a tool to do this so i found it in CombineVariants...
    If you were looking for this I'm sure you would find about it as well. :P

  • jjmmiijjmmii Member

    I need CombineVariants to merge two VCFs with priority. Seems this can't be done in MergeVcfs, so I'm adding my vote for GATK4 CombineVariants as well. Thanks so much to the developers.

  • aschoenraschoenr Member

    Hi @bhanuGandham, sorry, I didn't see this until now. I needed to use CombineVariants in my implementation of the Germline short variant discovery workflow. Specifically, I implemented the joint genotyping in parallel, processing each chromosome individually. Since I needed a single VCF for VQSR, I needed to combine all of the VCFs I produced. I tried another method and I ended up with far too many entries (columns) in the resulting VCF. Since the samples were the same (and in the same order) in all of my VCFs, I think I just needed to basically concatenate the files together while maintaining the header, and CombineVariants was the tool I found mentioned on the GATK forums to do this. Please let me know if I can achieve this with another tool in GATK4. Like I said, this is probably the simplest case in which VCFs need to be combined.

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    Thanks for all the feedback!

Sign In or Register to comment.