We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Filter samples of bad quality before running GermlineCNVCaller

Do you filter out samples of bad quality (e.g. high variability in read counts) before constructing the model in GermlineCNVCaller cohort mode as it is known from other CNV calling methods? Which metrics would you recommend to identify low quality samples? Or are these bad samples automatically leveled out if the cohort is just big enough?

I often observe a few samples making up a large proportion of all found copy-number variations (mainly false positives I guess) and I wonder if I can filter out these samples beforehand.


  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    It may be better to check mean and median coverage of samples before generating a cohort. Greater the variability between samples the more false positives and negatives you get.

  • asmirnovasmirnov BroadMember, Broadie, Dev ✭✭

    @cruckert The GATK gCNV model is capable of handling outlying samples and cohorts with batch effects. That said you can run into problems if you use too few samples in a highly heterogeneous cohort.

    We deal with this by pre-clustering samples based on their coverage and choosing subset of samples from each discovered clusters to run in COHORT mode (see https://software.broadinstitute.org/gatk/documentation/article?id=11684). This ensures that we model batch effects from a single batch, making gCNV's job a lot easier. We also use ~200 samples for training each model in COHORT mode.

Sign In or Register to comment.