Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Bug in GATK exome pipeline - VerifyBamID step

Hi, In the CheckContamination task of the gatk-workflows exome pipeline, there is a bug where VerifyBamID sometimes fails with the following error. Perhaps due to exome data inputs that are a bit smaller than what the Broad usually produces.
NOTICE - Process chr22:50683836-50683836...
NOTICE - Process chr22:50694325-50694325...
NOTICE - Process chr22:50745507-50745507...
NOTICE - Process chr22:50774185-50774185...
NOTICE - Number of marker in Reference Matrix:99976
NOTICE - Number of marker shared with input file:9565
NOTICE - Mean Depth:26.386618
NOTICE - SD Depth:43.449975
NOTICE - 9364 SNP markers remained after sanity check.

WARNING -
Insufficient Available markers, check input bam depth distribution in output pileup file after specifying --OutputPileup
2019/08/31 09:54:55 Starting delocalization.
2019/08/31 09:54:57 Delocalizing output /cromwell_root/UDP-1103.exome.preBqsr.selfSM -> gs://fc-secure-da2df7c1-77ef-4ed0-95ab-20cbed757a2a/95aed782-1446-4314-bb03-e9388219d197/ExomeGermlineSingleSample/b5e52cca-926e-44c1-8ddc-d55571b087ab/call-UnmappedBamToAlignedBam/UnmappedBamToAlignedBam/b6b3104e-8119-4636-8f04-7c601578ddf0/call-CheckContamination/UDP-1103.exome.preBqsr.selfSM
Required file output '/cromwell_root/UDP-1103.exome.preBqsr.selfSM' does not exist.

Answers

  • bshifawbshifaw Member, Broadie, Moderator admin

    @GER Thanks for reporting this.

    Here is what found out from my team

    “It [VerifyBamID] reports warning if the number of polymorphic markers are less than 1,000 or less than 10% of provided marker. I guess you provided genome-wide marker data but the sequence data is exome (is that right?). It is better to create exome-only VCF as your reference for more accurate inference.”

    The marker data being contamination_sites_ud, contamination_sites_mu, and contamination_sites_bed.

    They've mentioned a possible workaround is to use bedtools intersect to remove the markers that are not exome and test the pipeline to see if it works.
    I'll inform the developers of the pipeline of the issue.

Sign In or Register to comment.