Running joint-discovery-gatk4-local.wdl on hg19

Quoting from the 'About "Ask the team"' thread, since the "ask a question" button is working again:

@oneillkza said:
Running joint-discovery-gatk4-local.wdl on hg19

(Posting this here, since per the above posts, the "ask a question" button is disabled. Please feel free to move this to a thread.)

I'm trying to run the joint-discovery-gatk4-local.wdl on data aligned to hg19. You have provided example input json files, but only for the hg38 case. I'm in the process of generating the (many!) inputs it needs, but had a few questions:

Question 1

Many of the files needed are supplied in the GATK bundle ftp site. However, the centre I'm at has banned regular ftp (we can only use sftp), and there are quite a few files to download. Is there an easy way to get the contents of ftp://ftp.broadinstitute.org/bundle/hg19/ in a single file? My alternatives are to download the files one by one via the web browser, or to write a script using wget to scrape them.

Question 2

The input file lists a number of resource files, all of which have obvious corresponding files available in ftp://ftp.broadinstitute.org/bundle/hg38/. However, it looks like there's been a lot of consolidation of files between hg19 and hg38, and it's not entirely clear which ones to use (e.g. there are two different dbSNP files, two different hapmap files, etc). Is there a table somewhere documenting which of these are the best practices to include?

"##_COMMENT4": "RESOURCE FILES",
  "JointGenotyping.dbsnp_vcf": "/home/bshifaw/broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf",
  "JointGenotyping.dbsnp_vcf_index": "/home/bshifaw/broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf.idx",
  "JointGenotyping.one_thousand_genomes_resource_vcf": "/home/bshifaw/broad-references/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz",
  "JointGenotyping.one_thousand_genomes_resource_vcf_index": "/home/bshifaw/broad-references/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz.tbi",
  "JointGenotyping.omni_resource_vcf": "/home/bshifaw/broad-references/hg38/v0/1000G_omni2.5.hg38.vcf.gz",
  "JointGenotyping.omni_resource_vcf_index": "/home/bshifaw/broad-references/hg38/v0/1000G_omni2.5.hg38.vcf.gz.tbi",
  "JointGenotyping.mills_resource_vcf": "/home/bshifaw/broad-references/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz",
  "JointGenotyping.mills_resource_vcf_index": "/home/bshifaw/broad-references/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi",
  "JointGenotyping.axiomPoly_resource_vcf": "/home/bshifaw/broad-references/hg38/v0/Axiom_Exome_Plus.genotypes.all_populations.poly.hg38.vcf.gz",
  "JointGenotyping.axiomPoly_resource_vcf_index": "/home/bshifaw/broad-references/hg38/v0/Axiom_Exome_Plus.genotypes.all_populations.poly.hg38.vcf.gz.tbi",
  "JointGenotyping.hapmap_resource_vcf": "/home/bshifaw/broad-references/hg38/v0/hapmap_3.3.hg38.vcf.gz",
  "JointGenotyping.hapmap_resource_vcf_index": "/home/bshifaw/broad-references/hg38/v0/hapmap_3.3.hg38.vcf.gz.tbi",

**Question 3 **

How important are these resource files / do they constitute best practices? The WDL as written in that repository requires that all of them be provided as inputs, and Cromwell won't execute it if they aren't. However, I note that @Geraldine_VdAuwera has a year-old pull request with a version of the WDL file that does not require any of these resources. Is it safe to use (or adapt) that, or does not using the known SNP VCFs fall outside GATK best practices?

Question 4

I've managed to generate my own JointGenotyping.eval_interval_list by using something like the below script.

$VCFUTILS splitchr -l 50000000 ./GRCh37-lite.fa.fai > hg19_intervals_50M.txt
cat hg19_intervals_50M.txt | tr ':' '\t' | tr '-' '\t' > hg19_intervals_50M.bed
$GATK BedToIntervalList -I hg19_intervals_50M.bed -O hg19_intervals_50M.list -SD GRCh37-lite.dict

However, I note that there is also a need for a JointGenotyping.unpadded_intervals_file. In the hg38 JSON, this is /home/bshifaw/broad-references/hg38/v0/hg38.even.handcurated.20k.intervals. However, there does not seem to be an equivalent even for hg38 in the Broad bundle ftp site. What is this file, how is it generated, is it critical to the running of joint genotyping, and if not, what do I need to change in the WDL to disable it as an input?

I'll probably have some more questions as a I go, but thought this would be a good start.

Thanks!

Best Answers

  • oneillkzaoneillkza
    Accepted Answer

    So, further answering my own questions:

    Deep in one of the (several!) documents about the resources bundle is a comment pointing to the Google Cloud Bucket version of the hg19 bundle. It might be good to add a link to this on the pages describing the bundle.

    That version of the bundle seems to be much closer to what the JointGenotyping WDL workflow is expecting, and even includes the genomic intervals.

    It also seems like this can be downloaded as a folder using gsutil.

  • oneillkzaoneillkza
    Accepted Answer

    @AdelaideR -- thanks! I was able to get the files via gsutil cp. To be honest, I suspect it does better error checking than FTP, so I do feel more comfortable using that to download multi-GB files.

    The file names on the cloud version seem to be more closely matched, at least for the critical ones needed to run VQSR.

    I'm currently re-running the whole workflow anyway (after finding, right at the end of the entire joint genotyping, that our BAM files didn't have the @RG SM tag set, and that this is critical for GATK to keep samples separate).

    This time I'm trying to do it via Cromwell/WDL, but am having some interesting issues with Java memory (on a server that has about 800GB of RAM). This thread is providing some helpful info:

    https://gatkforums.broadinstitute.org/wdl/discussion/9617/is-there-a-way-to-put-a-limit-on-the-concurrency-of-a-task-during-a-scatter

    (In the end, I may just set this up to run on our Slurm cluster instead, since that's how we'll likely be using it in production anyway.)

  • oneillkzaoneillkza
    Accepted Answer

    Thanks @AdelaideR -- to be honest the issue is more that Cromwell is explicitly not designed to run on a large server with no queue, which is what I was trying to do.

    Using the Slurm cluster is a reasonable workaround for me.

  • oneillkzaoneillkza
    Accepted Answer

    Anyway, I'm busy running Haplotypecaller on our Slurm cluster now, with some success. One minor issue is that Cromwell is using up 56GB of RAM on the head node, so in future I'll probably have to request a separate compute node to run it from.

    Anyway, the next step is to actually run the Joint Genotyping, as per my original question...

    One of the files missing is JointGenotyping.unpadded_intervals_file. For hg38, this is /home/bshifaw/broad-references/hg38/v0/hg38.even.handcurated.20k.intervals aka gs://gatk-test-data/intervals/hg38.even.handcurated.20k.intervals There doesn't seem to be an equivalent for hg19.

    However, the intervals just look like standard intervals, as generated by tools like vcftools, on roughly 200KB blocks. Is there something special about the exact placing of the intervals, or can I generate my own intervals list of 200KB spaces using hg19 as a reference? (As per the attached file)

  • oneillkzaoneillkza
    Accepted Answer

    Also the one file that seems to be missing from the Google Cloud hg19 bundle is Axiom_Exome_Plus.genotypes.all_populations.poly.hg38.vcf.gz.

    However, this does seem to be available at:

    ftp://ftp.broadinstitute.org/pub/ExAC_release/release1/resources/

    It doesn't specify the genome version there, but it looks old enough that it's probably hg19.

  • oneillkzaoneillkza
    Accepted Answer

    Just as another note on the joint genotyping WDL.

    There is a line in

    Int num_of_original_intervals = length(read_lines(unpadded_intervals_file))
    

    However, the file it is reading, hg38.even.handcurated.20k.intervals, is approximately 500KB in size. (The intervals file I produced is also about 300KB). )When you try and run this in Cromwell, it raises the following error:

    WorkflowManagerActor Workflow e0ae345b-fee5-4fb1-ba93-9cf4fc55749e failed (during ExecutingWorkflowState): java.lang.RuntimeException: Failed to evaluate 'JointGenotyping.num_of_original_intervals' (reason 1 of 1): Evaluating length(read_lines(unpadded_intervals_file)) failed: [Attempted 1 time(s)] - IOException: Could not read from hg19_intervals_200K.txt: File hg19_intervals_200K.txt is larger than 128000 Bytes. Maximum read limits can be adjusted in the configuration under system.input-read-limits.
    

    Per https://github.com/broadinstitute/cromwell/issues/2768 this is an issue with the default maximum file size setting in Cromwell, which needs to be overridden by passing -Dsystem.input-read-limits.lines=500000 when running the Cromwell jar.

    Since the workflow as currently provided will not execute without this tweak, it might be helpful to mention this in the documentation on https://github.com/gatk-workflows/gatk4-germline-snps-indels

Answers

  • oneillkzaoneillkza Member

    Just as an update on getting hold of the resource files for those of us who don't have access to a reasonable ftp client, here's a simple bash script that'll do that using wget:

    #!/bin/bash
    
    files=`wget -O - ftp://[email protected]:21/bundle/hg19/ | grep href | grep -v "\/\." | sed s/.*href=\"// | sed s/\"\>.*//`
    
    for file in $files;
        do
        wget $file
        done;
    
  • oneillkzaoneillkza Member

    Getting some hints in that the filenames in the md5sums don't match those on the FTP site.

    eg

    $cat 1000G_omni2.5.hg19.sites.vcf.gz.md5
    7c4b30bb164554939fa8a5581e93290d  /humgen/gsa-scr1/pub/bundle/2.8/hg19/1000G_omni2.5.hg19.vcf.gz
    

    ie the actual name of 1000G_omni2.5.hg19.sites.vcf.gz.md5 as used internally by GATK and the WDL workflow is 1000G_omni2.5.hg19.vcf.gz

  • oneillkzaoneillkza Member
    Accepted Answer

    So, further answering my own questions:

    Deep in one of the (several!) documents about the resources bundle is a comment pointing to the Google Cloud Bucket version of the hg19 bundle. It might be good to add a link to this on the pages describing the bundle.

    That version of the bundle seems to be much closer to what the JointGenotyping WDL workflow is expecting, and even includes the genomic intervals.

    It also seems like this can be downloaded as a folder using gsutil.

  • oneillkzaoneillkza Member

    Hmm -- that scraping script failed when the server started refusing logins.

    Logging in as gsapubftp-anonymous ... 
    Login incorrect.
    

    I'm going to try the Google cloud path instead.

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    @oneillkza I see that you are working out a few different options to solve your own question. Thanks for tracking this issue and please chime in when you find out whether the Google cloud path option worked.

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    @ldalessi, Can you please look into this issue with the ftp? It seems to be happening to @oneillkza.

    @oneillkza I was able to get the resource bundle through the google cloud site. but I do see how the filenames are mismatched. I will post when I get some follow up from the development team.

  • oneillkzaoneillkza Member
    Accepted Answer

    @AdelaideR -- thanks! I was able to get the files via gsutil cp. To be honest, I suspect it does better error checking than FTP, so I do feel more comfortable using that to download multi-GB files.

    The file names on the cloud version seem to be more closely matched, at least for the critical ones needed to run VQSR.

    I'm currently re-running the whole workflow anyway (after finding, right at the end of the entire joint genotyping, that our BAM files didn't have the @RG SM tag set, and that this is critical for GATK to keep samples separate).

    This time I'm trying to do it via Cromwell/WDL, but am having some interesting issues with Java memory (on a server that has about 800GB of RAM). This thread is providing some helpful info:

    https://gatkforums.broadinstitute.org/wdl/discussion/9617/is-there-a-way-to-put-a-limit-on-the-concurrency-of-a-task-during-a-scatter

    (In the end, I may just set this up to run on our Slurm cluster instead, since that's how we'll likely be using it in production anyway.)

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    @oneillkza Thank you for the feedback.

    @ChrisL have you got any advice on how to set the Java memory to optimize the workflow?

  • oneillkzaoneillkza Member
    Accepted Answer

    Thanks @AdelaideR -- to be honest the issue is more that Cromwell is explicitly not designed to run on a large server with no queue, which is what I was trying to do.

    Using the Slurm cluster is a reasonable workaround for me.

  • oneillkzaoneillkza Member
    Accepted Answer

    Anyway, I'm busy running Haplotypecaller on our Slurm cluster now, with some success. One minor issue is that Cromwell is using up 56GB of RAM on the head node, so in future I'll probably have to request a separate compute node to run it from.

    Anyway, the next step is to actually run the Joint Genotyping, as per my original question...

    One of the files missing is JointGenotyping.unpadded_intervals_file. For hg38, this is /home/bshifaw/broad-references/hg38/v0/hg38.even.handcurated.20k.intervals aka gs://gatk-test-data/intervals/hg38.even.handcurated.20k.intervals There doesn't seem to be an equivalent for hg19.

    However, the intervals just look like standard intervals, as generated by tools like vcftools, on roughly 200KB blocks. Is there something special about the exact placing of the intervals, or can I generate my own intervals list of 200KB spaces using hg19 as a reference? (As per the attached file)

  • oneillkzaoneillkza Member
    Accepted Answer

    Also the one file that seems to be missing from the Google Cloud hg19 bundle is Axiom_Exome_Plus.genotypes.all_populations.poly.hg38.vcf.gz.

    However, this does seem to be available at:

    ftp://ftp.broadinstitute.org/pub/ExAC_release/release1/resources/

    It doesn't specify the genome version there, but it looks old enough that it's probably hg19.

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    Hi there, I am tagging @Tiffany_at_Broad to get some follow up on these file ftp issues. Thank you @oneillkza for providing the details.

  • oneillkzaoneillkza Member
    Accepted Answer

    Just as another note on the joint genotyping WDL.

    There is a line in

    Int num_of_original_intervals = length(read_lines(unpadded_intervals_file))
    

    However, the file it is reading, hg38.even.handcurated.20k.intervals, is approximately 500KB in size. (The intervals file I produced is also about 300KB). )When you try and run this in Cromwell, it raises the following error:

    WorkflowManagerActor Workflow e0ae345b-fee5-4fb1-ba93-9cf4fc55749e failed (during ExecutingWorkflowState): java.lang.RuntimeException: Failed to evaluate 'JointGenotyping.num_of_original_intervals' (reason 1 of 1): Evaluating length(read_lines(unpadded_intervals_file)) failed: [Attempted 1 time(s)] - IOException: Could not read from hg19_intervals_200K.txt: File hg19_intervals_200K.txt is larger than 128000 Bytes. Maximum read limits can be adjusted in the configuration under system.input-read-limits.
    

    Per https://github.com/broadinstitute/cromwell/issues/2768 this is an issue with the default maximum file size setting in Cromwell, which needs to be overridden by passing -Dsystem.input-read-limits.lines=500000 when running the Cromwell jar.

    Since the workflow as currently provided will not execute without this tweak, it might be helpful to mention this in the documentation on https://github.com/gatk-workflows/gatk4-germline-snps-indels

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    @oneillkza Thank you again for being so specific, I have passed this information along to the documentation team. This has been a very a helpful set of suggestions that will improve the experience for other users.

  • oneillkzaoneillkza Member

    No worries.

    Another consideration which I noticed when running joint genotyping is that it dynamically merges the intervals anyway. Since this is tuned for tens of thousands of genomes, in the case of a smaller dataset (e.g. trio analysis), all the intervals get merged on each chromosome, and the workflow just ends up scattering over chromosomes.

    This is a little bit inefficient -- it might be nice to have some way of controlling how aggressively the intervals are merged.

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin
    edited December 12

    @oneillkza I believe there is an option for an intervals list parameter so that you can control the merging of intervals. The option --interval-merging-rule OVERLAPPING_ONLY should prevent adjacent intervals from being merged together. This could be an option to test implement to see if it alleviates the inefficiency!

Sign In or Register to comment.