If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
We will be out of the office on November 11th and 13th 2019, due to the U.S. holiday(Veteran's day) and due to a team event(Nov 13th). We will return to monitoring the GATK forum on November 12th and 14th respectively. Thank you for your patience.
Running joint-discovery-gatk4-local.wdl on hg19
Quoting from the 'About "Ask the team"' thread, since the "ask a question" button is working again:
Running joint-discovery-gatk4-local.wdl on hg19
(Posting this here, since per the above posts, the "ask a question" button is disabled. Please feel free to move this to a thread.)
I'm trying to run the joint-discovery-gatk4-local.wdl on data aligned to hg19. You have provided example input json files, but only for the hg38 case. I'm in the process of generating the (many!) inputs it needs, but had a few questions:
Many of the files needed are supplied in the GATK bundle ftp site. However, the centre I'm at has banned regular ftp (we can only use sftp), and there are quite a few files to download. Is there an easy way to get the contents of ftp://ftp.broadinstitute.org/bundle/hg19/ in a single file? My alternatives are to download the files one by one via the web browser, or to write a script using wget to scrape them.
The input file lists a number of resource files, all of which have obvious corresponding files available in ftp://ftp.broadinstitute.org/bundle/hg38/. However, it looks like there's been a lot of consolidation of files between hg19 and hg38, and it's not entirely clear which ones to use (e.g. there are two different dbSNP files, two different hapmap files, etc). Is there a table somewhere documenting which of these are the best practices to include?
"##_COMMENT4": "RESOURCE FILES", "JointGenotyping.dbsnp_vcf": "/home/bshifaw/broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf", "JointGenotyping.dbsnp_vcf_index": "/home/bshifaw/broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf.idx", "JointGenotyping.one_thousand_genomes_resource_vcf": "/home/bshifaw/broad-references/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz", "JointGenotyping.one_thousand_genomes_resource_vcf_index": "/home/bshifaw/broad-references/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz.tbi", "JointGenotyping.omni_resource_vcf": "/home/bshifaw/broad-references/hg38/v0/1000G_omni2.5.hg38.vcf.gz", "JointGenotyping.omni_resource_vcf_index": "/home/bshifaw/broad-references/hg38/v0/1000G_omni2.5.hg38.vcf.gz.tbi", "JointGenotyping.mills_resource_vcf": "/home/bshifaw/broad-references/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz", "JointGenotyping.mills_resource_vcf_index": "/home/bshifaw/broad-references/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi", "JointGenotyping.axiomPoly_resource_vcf": "/home/bshifaw/broad-references/hg38/v0/Axiom_Exome_Plus.genotypes.all_populations.poly.hg38.vcf.gz", "JointGenotyping.axiomPoly_resource_vcf_index": "/home/bshifaw/broad-references/hg38/v0/Axiom_Exome_Plus.genotypes.all_populations.poly.hg38.vcf.gz.tbi", "JointGenotyping.hapmap_resource_vcf": "/home/bshifaw/broad-references/hg38/v0/hapmap_3.3.hg38.vcf.gz", "JointGenotyping.hapmap_resource_vcf_index": "/home/bshifaw/broad-references/hg38/v0/hapmap_3.3.hg38.vcf.gz.tbi",
**Question 3 **
How important are these resource files / do they constitute best practices? The WDL as written in that repository requires that all of them be provided as inputs, and Cromwell won't execute it if they aren't. However, I note that @Geraldine_VdAuwera has a year-old pull request with a version of the WDL file that does not require any of these resources. Is it safe to use (or adapt) that, or does not using the known SNP VCFs fall outside GATK best practices?
I've managed to generate my own
JointGenotyping.eval_interval_listby using something like the below script.
$VCFUTILS splitchr -l 50000000 ./GRCh37-lite.fa.fai > hg19_intervals_50M.txt cat hg19_intervals_50M.txt | tr ':' '\t' | tr '-' '\t' > hg19_intervals_50M.bed $GATK BedToIntervalList -I hg19_intervals_50M.bed -O hg19_intervals_50M.list -SD GRCh37-lite.dict
However, I note that there is also a need for a
JointGenotyping.unpadded_intervals_file. In the hg38 JSON, this is
/home/bshifaw/broad-references/hg38/v0/hg38.even.handcurated.20k.intervals. However, there does not seem to be an equivalent even for hg38 in the Broad bundle ftp site. What is this file, how is it generated, is it critical to the running of joint genotyping, and if not, what do I need to change in the WDL to disable it as an input?
I'll probably have some more questions as a I go, but thought this would be a good start.