We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Contamination estimation while following the Best Practices workflow


I've started following the Best Practices guide to process >80 whole-genome samples (reference-confidence model workflow). During variant calling via HaplotypeCaller, a few of the gVCFs blew up in size (> 1TB, while the rest of gVCFs were around 75 GB). I suspect that these samples are contaminated. I'd like to:

1) Confirm that the samples that produced large gVCFs are contaminated
2) Check that the rest of the samples are unaffected (higher priority)

I've started looking at GATK-friendly ConTest [1]. It seems that it requires genotyped VCF files as input. Is it appropriate to call GenotypeGVCFs individually for each gVCF and use the produced VCF as input for ConTest? Or can I use "raw" gVCFs?

I've also considered using cleanCall [2], but sounds like I'd need to repeat variant calling using samtools (took ages to produce gVCFs, so trying avoid this) or somehow do the gVCF -> VCF -> PED conversion.

Thank you!

[1] https://www.broadinstitute.org/cancer/cga/contest_run
[2] https://github.com/hyunminkang/cleanCall


  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited July 2016

    Hi @Alesha

    Wilth ContEst you should be able to not only detect contamination but also estimate the extent (%) of contamination and the 95% confidence interval of this estimate for your samples. It is a highly sensitive tool.

    ContEst can genotype on the fly from your BAM as shown in the following example command:

    When interpreting the command, keep in mind ConTest was originally designed for use with tumor and normal paired samples and the genotype file was an array genotype file obtained independently so folks could also detect sample swaps. I have in my notes that on-the-fly genotyping calls any site with >80% bases showing ALT and requires at least 50x coverage homozygous sites. I'm not aware of updates to the tool in this regard so if your coverage is lower and the on-the-fly genotyping errors, then you'll have to resort to providing a genotype VCF that you derive elsewhere, e.g. your per sample GVCFs. GVCFs are valid VCFs.

    As you imply from your question, genotype calls can change in the GVCF to multi-sample VCF conversion via GenotypeGVCFs given the multiple samples allow for more accurate genotype likelihoods. Unless you expect many such changes to genotype calls for your samples, for the purposes of estimating contamination, I think it's safe to use the GVCF.

    You will need a population allele frequency file to use ContEst. This provides population variant sites from which ContEst looks for the intersection with your sample's variant sites. Of these, ContEst uses sites in the genotype VCF that are homozygous variant. ContEst then considers the subset of these in the sample BAM that are non-hom-var in calculating the contamination fraction. The population allele frequency file enables calculation of probabilities of contamination from common SNPs. Given all this, the tool should altogether ignore the <NON_REF> blocks with their hom-ref genotypes (0/0) in the GVCFs. If this is not the case and you see an error, then please let us know.

    I cannot comment on outside tools such as cleanCall but hopefully the information above is helpful towards your aims. I'm a bit curious as to how a GVCF can blow up more than 10-fold in size and I hope you can sort this all out.

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭


    You may want to have a look at this thread as well.


  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Thanks Sheila. To quote Sheila's linked thread:

    [T]he size of GVCF files depends largely on how homogeneous the data quality is. If there is a lot of variability in the amount of data present from region to region, or site to site, you will end up with very fragmented reference blocks, which means more records in the file, and therefore a larger file size. This is more likely to happen in files with less coverage....

    So @Alesha, given the above, you should also consider variation in the evenness of coverage for the genomic intervals you use to call variants as a reason for inflated file size.

Sign In or Register to comment.