Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

How to consolidate 81 GVCF files for 35,000 intervals ?

Our aim is to mine exome capture DNA sequencing data, generated for 81 provenances of tropical pine tree species (5-8 trees were pooled per provenance), for informative SNPs. DNA fragments were captured by 35,000 probes. We mapped the exome capture data against the full, but highly fragmented Pinus taeda 2.0 genome (22GB; consisting of 1.76 million scaffolds).

We have per-sample BAM files that have been pre-processed as described in the GATK Best Practices for data pre-processing. As suggested in the GATK Best Practices “Germline short variant discovery (SNPs + Indels)” workflow, we called variants per sample in order to produce per-sample files in GVCF format. We currenty assume a diploid model despite working with pooled samples.

The bottleneck is when we consolidate GVCFs from multiple samples into a GenomicsDB datastore. We use the latest version of GATK (4.0.12.0), with multi-interval support. We perform the analysis on a server with 3 TB memory and 96 CPU cores, using the following command:

```
gatk --java-options "-Xmx2500G" GenomicsDBImport -V f1.vcf -V f2.vcf -V f3.vcf -V f4.vcf -V f5.vcf -V f6.vcf -V f7.vcf -V f8.vcf -V f9.vcf -V f10.vcf -V f11.vcf --genomicsdb-workspace-path outputDB -L capture_probe_regions.bed

```

Each capture probe region is roughly 800pb, mostly on different scaffolds. I performed a few test runs with between 2 and 10 intervals. It takes ~2.5 hours per interval for 81 files (all files, which is what we would like to do) and ~20 minutes per interval for 11 files (the files for one sub-species). It is impractical to perform this step for 35,000 intervals.

Any advice will be appreciated. Thank you in advance.

Nanette

Answers

  • bshifawbshifaw admin Member, Broadie, Moderator admin

    For large number of samples we recommend you use GenomicsDBImport, the other option would be CombineGVCFs but this is suggested for a smaller number of gvcfs and would take a longer time.

    It might help going over this GenomicsDBImport document to make sure there aren't any additional changes that can made to your data processing to help. The document also mentions batch options for very large numbers of samples which is described in this forum post.

Sign In or Register to comment.