Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Dropping Samples from gVCF

GATK Admins,

We are currently working on a project, and we have found that some of our samples were contaminated after the gVCF merging phase. Is it possible to remove samples from a merged gVCF (likely using SelectVariants), or would we need to re-merge only the good gVCFs into a new merged gVCF? (Note that we're actually working with a double-merged gVCF file containing ~5,000 samples, so re-merging would be potentially costly).


John Wallace

Best Answer


  • @Sheila,

    We currently have >40,000 samples, which requires merging gVCFs twice (typically in sets of ~150, then merging ~30 combined gVCFs). So, this means that we will have to re-merge and re-merge again for any gVCF that contains a sample that we wish to delete?



  • SheilaSheila Broad InstituteMember, Broadie admin

    Hi John,

    Yes, you are correct. I am hoping you only have a few combined GVCFs that contain contaminated samples!

    Good luck.


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    To elaborate a little bit on Sheila's answer, the reason you're better off re-merging from clean samples is because the site-level annotation values of record get averaged during merging, and SelectVariants may not be able to recalculate all annotations appropriately when you subset to eliminate contaminated samples.

    However, considering the non-trivial size of your cohort it may be worth your while to run some small-scale tests to see what would be the tradeoff if you decide to take the SelectVariants shortcut. IIRC the few site-level annotations emitted to GVCF get recalculated by GenotypeGVCFs anyway. Also, if you need to take 5 samples out of 40,000, chances are that the impact of those individual samples on the site stats is going to be marginal to nonexistent (unless they happen to include low-frequency variants that are private to just that subset -- which would be a massive case of dumb rotten luck). So I would recommend randomly choosing a small number of intervals (that contain both reference blocks and variant calls), running both scenarios, and evaluating what changes happen if any. And please let us know if you observe anything interesting!

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    @Geraldine_VdAuwera said:
    And please let us know if you observe anything interesting!

    @johnwallace123 Just out of curiosity. Did you find anything interesting? :smile:

Sign In or Register to comment.