Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

GATK v3.3-0 CombineVariants ignoring assumeIdenticalSamples flag?

Hey GATK Team!

I'm attempting to merge identical sample sets called on different (disjoint) chromosomes with the CombineVariants tool with the --assumeIdenticalSamples flag enabled, but the v3.3-0 tool is behaving as if this option is not enabled and it requests I specify a --genotypemergeoption. When I execute the same exact command using v3.2-2, the tool runs to completion without error.

Enabling --genotypemergeoption UNIQIFY with v3.3-0 just to see what happens (while --assumeIdenticalSamples is still enabled) outputs each sample 18 times (once for each chromosome) without any loss in the number of loci (sum of loci in individual files equals sum in combined output file).

I believe this may be a bug, but I could be wrong...

INFO  13:10:49,188 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  13:10:49,190 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-g37228af, Compiled 2014/10/24 01:07:22 
INFO  13:10:49,191 HelpFormatter - Copyright (c) 2010 The Broad Institute 
INFO  13:10:49,191 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
INFO  13:10:49,194 HelpFormatter - Program Args: -T CombineVariants -R genome.fasta -V ./homVar.list --out homVar.SNP.vcf --suppressCommandLineHeader --assumeIdenticalSamples 
INFO  13:10:49,198 HelpFormatter - Executing as [email protected] on Linux 2.6.32-431.20.3.el6.nersc.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_51-b13. 
INFO  13:10:49,198 HelpFormatter - Date/Time: 2014/12/28 13:10:49 
INFO  13:10:49,199 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  13:10:49,199 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  13:10:49,731 GenomeAnalysisEngine - Strictness is SILENT 
INFO  13:10:50,015 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 
INFO  13:10:50,731 GenomeAnalysisEngine - Preparing for traversal 
INFO  13:10:50,751 GenomeAnalysisEngine - Done preparing for traversal 
INFO  13:10:50,752 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
INFO  13:10:50,752 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining 
INFO  13:10:50,752 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime 
INFO  13:10:51,578 GATKRunReport - Uploaded run statistics report to AWS S3 
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 3.3-0-g37228af): 
##### ERROR
##### ERROR This means that one or more arguments or inputs in your command are incorrect.
##### ERROR The error message below tells you what is the problem.
##### ERROR
##### ERROR If the problem is an invalid argument, please check the online documentation guide
##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
##### ERROR
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
##### ERROR
##### ERROR MESSAGE: Duplicate sample names were discovered but no genotypemergeoption was supplied. To combine samples without merging specify --genotypemergeoption UNIQUIFY. Merging duplicate samples without specified priority is unsupported, but can be achieved by specifying --genotypemergeoption UNSORTED.
##### ERROR ------------------------------------------------------------------------------------------


INFO  13:02:06,270 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  13:02:06,272 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.2-2-gec30cee, Compiled 2014/07/17 15:22:03 
INFO  13:02:06,272 HelpFormatter - Copyright (c) 2010 The Broad Institute 
INFO  13:02:06,272 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
INFO  13:02:06,275 HelpFormatter - Program Args: -T CombineVariants -R genome.fasta -V ./homVar.list --out homVar.SNP.vcf --suppressCommandLineHeader --assumeIdenticalSamples 
INFO  13:02:06,279 HelpFormatter - Executing as [email protected] on Linux 2.6.32-431.20.3.el6.nersc.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_51-b13. 
INFO  13:02:06,280 HelpFormatter - Date/Time: 2014/12/28 13:02:06 
INFO  13:02:06,280 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  13:02:06,280 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  13:02:06,812 GenomeAnalysisEngine - Strictness is SILENT 
INFO  13:02:07,098 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 
INFO  13:02:07,818 GenomeAnalysisEngine - Preparing for traversal 
INFO  13:02:07,837 GenomeAnalysisEngine - Done preparing for traversal 
INFO  13:02:07,838 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
INFO  13:02:07,838 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining 
INFO  13:02:07,838 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime 
INFO  13:02:07,840 CombineVariants - Priority string is not provided, using arbitrary genotyping order: null 
INFO  13:02:10,798 ProgressMeter -            done      2967.0     2.0 s      16.6 m       88.9%     2.0 s       0.0 s 
INFO  13:02:10,799 ProgressMeter - Total runtime 2.96 secs, 0.05 min, 0.00 hours 
INFO  13:02:11,589 GATKRunReport - Uploaded run statistics report to AWS S3
Tagged:

Best Answer

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MA admin
    Accepted Answer

    The reason CombineVariants is now requesting a --genotypemergeoption specified is a bug, introduced in the current release as a side effect of a bug fix (dominoes, dominoes everywhere). The dev who was responsible has been duly flogged and has pushed a fix, which I believe is available in the nightly builds. Sorry about that!

Answers

  • bredesonbredeson Member ✭✭
    edited December 2014

    FYI, I verified that the call sets are indeed disjoint by chromosome.

  • pdexheimerpdexheimer Member ✭✭✭✭

    Well, I think the workaround is to use --genotypemergeoption UNSORTED, just like the error message says. I think the 'bug' is that UNSORTED should be supported if you explicitly provide --assumeIdenticalSamples.

    Alternatively, maybe --assumeIdenticalSamples should be removed as unsupported. The current best solution for this particular use case is CatVariants

  • bredesonbredeson Member ✭✭

    Hey @pdexheimer‌,

    Thank you for your response.

    It seems to me that my result from using --genotypemergeoption UNIQIFY demonstrates that v3.3-0 CombineVariants is not properly handling the chromosome names and position numbers to distinguish the loci as non-overlapping calls. If it were, I'd expect each sample to be represented N times (once for each chromosome, as I'm observing) and N-1 of the columns/genotypes should contain non-calls ('./.') at each locus because my data are disjoint. This is, however, not what I'm observing, all N genotypes have calls.

    Taking your suggestion, I tried --genotypemergeoption UNSORTED --assumeIdenticalSamples and it seems to have given me the same result as v3.2-2, but why is the --genotypemergeoption required now?

    These options do appear to have redundant purposes, as you say. Deprecating --assumeIdenticalSamples in favor of --genotypemergeoption UNSORTED would be confusing. To the general user, UNSORTED could be taken to mean either that the genotypes from multiple call sets for a given locus and for a given sample will be outputted in an unsorted manner (one column per sample in output), or that the loci will be merged with N output columns per sample, with the genotypes ordered per locus by whatever order the input VCFs were listed.

  • pdexheimerpdexheimer Member ✭✭✭✭
    edited December 2014

    Hi @bredeson‌ -

    Are you saying that UNIQIFY is treating each chromosome/sample pair as a distinct "sample", but giving each all of the calls for a given underlying sample? The second part of that would definitely be a bug. Just to clarify, can you post a toy example with one or two samples and three or so chromosomes?

    The assumeIdenticalSamples/UNSORTED conundrum is not correct right now. Adding UNSORTED to your command line is simply a workaround, it shouldn't be required. In my view, there are two approaches that could be taken, I'm not sure which is better:

    1. Remove the requirement to explicitly specify UNSORTED with assumeIdenticalSamples. This would elevate your use case from unsupported (as use of UNSORTED with duplicate sample names typically is) to supported, and revert the behavior you see to the pre-3.3 way.
    2. Remove the assumeIdenticalSamples argument from CombineVariants altogether, and clarify that the only supported way to solve your use case is to use CatVariants.

    Either solution will work, they're mostly a matter of determining which direction the Broadies want to support.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    Accepted Answer

    The reason CombineVariants is now requesting a --genotypemergeoption specified is a bug, introduced in the current release as a side effect of a bug fix (dominoes, dominoes everywhere). The dev who was responsible has been duly flogged and has pushed a fix, which I believe is available in the nightly builds. Sorry about that!

Sign In or Register to comment.