We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

GATK resource bundles scattered_calling_intervals exclude small contigs

Hi there,

I was just going over some Haplotypecaller and VQSR results generated using your best practices Cromwell workflows, and found that the scattered_calling_intervals files you provide (and which those workflows use to operate over) do not cover the whole genome. For hg38, chrM and all of the alt/unplaced contigs are excluded. For b37, chrY is also excluded.

https://console.cloud.google.com/storage/browser/gatk-legacy-bundles/b37
https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0/

This seems like a fairly major bug that would cause people running your best practices to lose a good number of potentially important variants.

Best Answer

Answers

  • Thank you for your reply, @AdelaideR

    It would definitely be helpful if that document were easy to find (for example, linked from the bundle page ), and more explicit. The implication is that only low-complexity regions such as centromeres are filtered out, which I don't think most people would expect to include genic regions.

    It's also frustrating that your WDL workflow ( haplotypecaller-gvcf-gatk4.wdl ) wants these in the format of 50 directories each with its own Picard interval list file, each containing around 10 intervals, with a text file specifying links to each of these files (which themselves have to be set up to be relative to the execution environment.

  • AdelaideRAdelaideR Member admin

    @oneillkza I agree that a readme file of some type would be helpful for the resource bundle, we have bounced around a few ideas about how best to maintain the document among our diverse teams. I will pass along the comment about the WDL workflow to see where adjustments can be made to streamline this process.

Sign In or Register to comment.