Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

CatVariants or CombineVariants

jacobhsujacobhsu Hong KongMember

If I want to merge different VCF files, which I used -L argument for calling variants against to different chromosomes individually with the same list of samples by HaplotypeCaller. I mean the sample are the same, I just used -L to call variants chromosome by chromosome separately. I suppose whether catVariants or CombineVariant will give me the same results, right ?

Comments

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Yes, they will give you the same results; only the headers will be slightly different because CatVariant will just use the header of the first VCF file, while CombineVariants will generate a new header. CatVariants is faster.

  • jacobhsujacobhsu Hong KongMember

    Thanks for quick reply. Here is the original command I tried

    ${java7} -Xmx2g -jar $GATK/GenomeAnalysisTK.jar \
    -R $reference_genome \
    -T CombineVariants \
    --variant $inputdir/chr01.vcf \
    --variant $inputdir/chr02.vcf \
    --variant $inputdir/chr03.vcf \
    --variant $inputdir/chr04.vcf \
    -o $vcf_output \
    -nt 8 \
    -genotypeMergeOptions UNIQUIFY

    Will that be different than this ?

    ${java7} -cp $GATK/GenomeAnalysisTK.jar org.broadinstitute.sting.tools.CatVariants \
    -R $reference_genome \
    -V $inputdir/chr01.vcf \
    -V $inputdir/chr02.vcf \
    -V $inputdir/chr03.vcf \
    -V $inputdir/chr04.vcf \
    -out $vcf_output \
    -assumeSorted

    As you mentioned it, the CombineVariants will generate another header and the size is larger than CatVariants very much, but the content is the same. Could you please explain more ?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    The results, in terms of how variant records are combined, should be essentially the same if the files you are merging involve all the same samples, just different chromosomes. What differences are you observing exactly? Can you maybe post the headers?

  • mglclinicalmglclinical USAMember

    Hi @Geraldine_VdAuwera ,

    I am usig GATK 3.5

    I am doing per chromosome HaplotypeCaller (in GVCF mode) calls with -L argument, which is similar to what @jacobhsu is doing.

    I know that CatVariants can be used to concatenate .vcf files for a given sample.

    I would like to confirm that CatVariants can also be used to concatenate .g.vcf files ?

    I am asking this question because in the CatVariants documentation here, there is no mention of .g.vcf files

    Thanks,
    mglcliinical

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @mglclinical, yes CatVariants works on GVCFs as they are valid VCFs.

  • bioinfo_89bioinfo_89 IndiaMember

    Hi @Geraldine_VdAuwera ,

    I am using GATK 3.8. I am calling variants on amplicon sequencing data using -L option. And I have SNPs and INDELs filtered separately in VCFs. I want to merge the SNPs and INDELs into a single file for the same I wanted to make sure if CatVariants is the most suitable or CombineVariants.

    Since it is mentioned for CatVariants that it is used for "non-overlapping genome intervals", does that mean it will keep a single variant if the start position is same? eg. if i have an SNP and an INDEL at same position will it merge both into one file or remove the other and jus keep one?

    Thanks,
    B89

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    No it means you cannot use CatVariants to merge files containing variants in the same range of positions. For this case you need to use CombineVariants.

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Will CombineVariants be ported to GATK4.0? This is probably one of the 3 tools left for me in GATK3.x
    (DepthOfCoverage and UnifiedGenotyper are the other two. )

  • bioinfo_89bioinfo_89 IndiaMember

    Ok so I will have to use CombineVariants with -genotypeMergeOptions UNIQUIFY option to merge the SNPs and INDELs file for each sample?

    Thanks!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @bioinfo_89
    Hi,

    Have a look at this thread. You should not add the extra argument.

    -Sheila

  • bioinfo_89bioinfo_89 IndiaMember

    Hi @Sheila ,

    I tried using Combine variants excluding the -genotypeMergeOptions UNIQUELY, it throws the following error:

    ERROR MESSAGE: Duplicate sample names were discovered but no genotypemergeoption was supplied. To combine samples without merging, specify --genotypemergeoption UNIQUIFY. Merging duplicate samples without specified priority is unsupported, but can be achieved by specifying --genotypemergeoption UNSORTED.

    I am not sure if the output will be same as I wand if i use UNSORTED option!

    Any suggestions?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @bioinfo_89
    Hi,

    I see. Sorry for the confusion. Can you try adding `--assumeIdenticalSamples?

    Thanks,
    Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @SkyWarrior
    Hi,

    CombineVariants is supposed to be ported. Have a look at this issue. I don't think the team has any plans to port UnifiedGenotyper, but there seems to be demand in the GATK community, so my team is aware and will see what we can do.
    The coverage tools have also been bumped up in priority to be ported over. Have a look at this ticket.

    -Sheila

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    It is good to hear these. Thanks.

Sign In or Register to comment.