To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at

How to view only the variants that are present in multiple MuTect VCF files?

I have several vcf files generated by MuTect comparing tumor to matched normal samples. Is there a way to generate a list or vcf file of the variants that are present in all or many of the samples? Something along the lines of: "these variants are in all of the samples," or "these variants are shared in 2 samples" and so on.

I tried vcf-isec on the MuTect vcf files, but I received a warning about the column names not matching (i.e. 1-Normal and 1-Tumor) and the output file was 28 bytes of unreadable characters.

Any help is appreciated, thank you.


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Have a look at the GATK tools for combining and selecting variants, starting with CombineVariants:

  • rontonronton USAMember

    Thank you for the advice Geraldine.

    So, I could run CombineVariants on all of my 10 MuTect vcf files to generate one output vcf file, and then run SelectVariants with -select 'set == "Intersection"' to output a single vcf file that includes variants shared by all input samples?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @ronton,

    Yep, that's the idea. Make sure to tag each one with a name when you pass them in (like `-V:name1 sample1.vcf).

  • rontonronton USAMember

    Thanks again, I will try this out ASAP.

  • rontonronton USAMember

    I received some errors when attemting to run CombineVariants. I apologize, I am not a super adept programmer.

    The .vcf files I have were generated by MuTect, using tumor.bam and normal.bam. So my 10 .vcf files are 10 different tumor normal pairs.

    The first error I received had to do with the .vcf files not being sorted properly (contigs not in the same order as the reference). I was able to sort the .vcf files using a script called I used the ucsc.hg19.dict file, the same reference file that I downloaded from Broad FTP and use in my /gatk_3.3/resources directory, to sort.

    Now, with the .mutect.sorted.vcf files, I ran:

    java -Xmx2g -jar GenomeAnalysisTK.jar -T CombineVariants -R /home/me/gatk_3.3/resources/ucsc.hg19.fasta -V:1 /home/me/test/1.mutect.sorted.vcf -V:2 /home/me/test/2.mutect.sorted.vcf -V:3 /home/me/test/3.mutect.sorted.vcf -V:4 /home/me/test/4.mutect.sorted.vcf -V:5 /home/me/test/5.mutect.sorted.vcf -V:6 /home/me/test/6.mutect.sorted.vcf -V:7 /home/me/test/7.mutect.sorted.vcf -V:8 /home/me/test/8.mutect.sorted.vcf -V:9 /home/me/test/9.mutect.sorted.vcf -V:10 /home/me/test/10.mutect.sorted.vcf -o /home/me/test/union.vcf

    From this, I received the following error:

    ERROR A USER ERROR has occurred (version 3.3-0-g37228af):
    ERROR This means that one or more arguments or inputs in your command are incorrect.
    ERROR The error message below tells you what is the problem.
    ERROR If the problem is an invalid argument, please check the online documentation guide
    ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions
    ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
    ERROR MESSAGE: Duplicate sample names were discovered but no genotypemergeoption was supplied. To combine samples without merging specify --genotypemergeoption UNIQUIFY. Merging duplicate samples without specified priority is unsupported, but can be achieved by specifying --genotypemergeoption UNSORTED.

    Next, I included -genotypeMergeOptions UNIQUIFY and ran:

    java -Xmx2g -jar GenomeAnalysisTK.jar -T CombineVariants -R /home/me/gatk_3.3/resources/ucsc.hg19.fasta -V:1 /home/me/test/1.mutect.sorted.vcf -V:2 /home/me/test/2.mutect.sorted.vcf -V:3 /home/me/test/3.mutect.sorted.vcf -V:4 /home/me/test/4.mutect.sorted.vcf -V:5 /home/me/test/5.mutect.sorted.vcf -V:6 /home/me/test/6.mutect.sorted.vcf -V:7 /home/me/test/7.mutect.sorted.vcf -V:8 /home/me/test/8.mutect.sorted.vcf -V:9 /home/me/test/9.mutect.sorted.vcf -V:10 /home/me/test/10.mutect.sorted.vcf -genotypeMergeOptions UNIQUIFY -o /home/me/test/union.vcf

    From this, I received the error:

    ERROR A GATK RUNTIME ERROR has occurred (version 3.3-0-g37228af):
    ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
    ERROR If not, please post the error message, with stack trace, to the GATK forum.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions
    ERROR MESSAGE: Key VT found in VariantContext field INFO at chr1:2418330 but this key isn't defined in the VCFHeader. We require all VCFs to have complete VCF headers by default.

    Does something need to be defined in the VCF headers?

    To clarify, what I am hoping to do is look at the 'overlapping' or common variants in two or more MuTect .vcf files. Any guidance or help with that is very appreciated, thank you.

  • rontonronton USAMember
    edited October 2014

    The .vcf file entries contain SOMATIC;VT=SNP, but I do not see VT at the beginning. Would that help? From one of the mutect.sorted.vcf files:

    ##FORMAT=<ID=AD,Number=2,Type=Integer,Description="# of reads supporting consensus reference/indel at the site">
    ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Total coverage at the site">
    ##FORMAT=<ID=MM,Number=2,Type=Float,Description="Average # of mismatches per ref-/consensus indel-supporting read">
    ##FORMAT=<ID=MQS,Number=2,Type=Float,Description="Average mapping qualities of ref-/consensus indel-supporting reads">
    ##FORMAT=<ID=NQSBQ,Number=2,Type=Float,Description="Within NQS window: average quality of bases in ref-/consensus indel-
    supporting reads">
    ##FORMAT=<ID=NQSMM,Number=2,Type=Float,Description="Within NQS window: fraction of mismatching bases in ref/consensus indel-
    supporting reads">
    ##FORMAT=<ID=REnd,Number=2,Type=Integer,Description="Median/mad of indel offsets from the ends of the reads">
    ##FORMAT=<ID=RStart,Number=2,Type=Integer,Description="Median/mad of indel offsets from the starts of the reads">
    ##FORMAT=<ID=SC,Number=4,Type=Integer,Description="Strandness: counts of forward-/reverse-aligned reference and indel-supporting
    reads (FwdRef,RevRef,FwdIndel,RevIndel)">
    ##INFO=<ID=SOMATIC,Number=0,Type=Flag,Description="Somatic event">
    ##SomaticIndelDetector="analysis_type=SomaticIndelDetector input_file=[NormalBAM.bam, TumorBAM.bam] read_buffer_size=null

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @ronton‌ ,

    The first problem you ran into is a bug in GATK 3.3; we added a check for samples that have the same name in CombineVariants that has the unintended consequence of forcing you to specify a merge option.

    The second problem seems to be a bug in MuTect, where the VT field dfinition is not added to the VCF header. You can edit the VCF manually to add a definition line to header, and that will resolve the issue so you don't have to wait for a fix.

Sign In or Register to comment.