To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

Dropping Samples from gVCF

GATK Admins,

We are currently working on a project, and we have found that some of our samples were contaminated after the gVCF merging phase. Is it possible to remove samples from a merged gVCF (likely using SelectVariants), or would we need to re-merge only the good gVCFs into a new merged gVCF? (Note that we're actually working with a double-merged gVCF file containing ~5,000 samples, so re-merging would be potentially costly).

Thanks,

John Wallace

Best Answer

Answers

  • @Sheila,

    We currently have >40,000 samples, which requires merging gVCFs twice (typically in sets of ~150, then merging ~30 combined gVCFs). So, this means that we will have to re-merge and re-merge again for any gVCF that contains a sample that we wish to delete?

    Thanks,

    John

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @johnwallace123
    Hi John,

    Yes, you are correct. I am hoping you only have a few combined GVCFs that contain contaminated samples!

    Good luck.

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    To elaborate a little bit on Sheila's answer, the reason you're better off re-merging from clean samples is because the site-level annotation values of record get averaged during merging, and SelectVariants may not be able to recalculate all annotations appropriately when you subset to eliminate contaminated samples.

    However, considering the non-trivial size of your cohort it may be worth your while to run some small-scale tests to see what would be the tradeoff if you decide to take the SelectVariants shortcut. IIRC the few site-level annotations emitted to GVCF get recalculated by GenotypeGVCFs anyway. Also, if you need to take 5 samples out of 40,000, chances are that the impact of those individual samples on the site stats is going to be marginal to nonexistent (unless they happen to include low-frequency variants that are private to just that subset -- which would be a massive case of dumb rotten luck). So I would recommend randomly choosing a small number of intervals (that contain both reference blocks and variant calls), running both scenarios, and evaluating what changes happen if any. And please let us know if you observe anything interesting!

  • tommycarstensentommycarstensen United KingdomMember

    @Geraldine_VdAuwera said:
    And please let us know if you observe anything interesting!

    @johnwallace123 Just out of curiosity. Did you find anything interesting? :smile:

Sign In or Register to comment.