Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Can I use arbitrary genomic intervals in GemomicDBImport

Hello,
I'm facing a problem of making variant calls from ~1500 WGS samples. I have successfully run HaplotypeCaller by breaking the "wgs_calling_regions.hg38.interval_list" which comes with the GATK resource bundle into individual chromosomes and used each sub file as the value for the "-L" parameter in HaplotypeCaller. Now I'm facing merging the gvcf files using GemomicDBImport under GATKv4.0.0.0. With gvcf files from ~1500 samples, merging them will take large memory (Tried 128GB but sill not enough) machines and the run time will take many days. Our local cluster only allows 72 hours for any computing job. Does anybody have a solution for this? I've beening thinking of using the genomic intervals in the "wgs_calling_regions.hg38.interval_list" file, however, the longest interval is 141,414,040bp for chr2 (97489619 238903659 + . intersection ACGTmer). So my question is: can I further break the larger single intervals into multiple smaller ones and use each of them to run GenomicDBImport?

Thanks

Tagged:

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @jianxinwang
    Hi,

    In this thread the team recommends running on 1 interval per sample. In your case, you have 1500 samples, so you would have ~1500 intervals. Looking at the intervals list for hg38 that we provide, it looks like there are over 3000 intervals. I think you should be fine running on each of those intervals. If not, feel free to split some of the larger intervals into smaller ones :smile:

    -Sheila

  • jianxinwangjianxinwang Member ✭✭

    @Sheila said:
    @jianxinwang
    Hi,

    In this thread the team recommends running on 1 interval per sample. In your case, you have 1500 samples, so you would have ~1500 intervals. Looking at the intervals list for hg38 that we provide, it looks like there are over 3000 intervals. I think you should be fine running on each of those intervals. If not, feel free to split some of the larger intervals into smaller ones :smile:

    -Sheila

    I'm a little confused. The interval file "wgs_calling_regions.hg38.interval_list" contains only 356 intervals, with the largest on chr2:97489619-238903659 which is 141MB. Not sure what does the "running on 1 interval per sample" mean. In one of my GenomicDBImport command, I specified "-L chr2:97489619-238903659". This is just one interval, but spans 141MB. If I split this interval into multiple ones, say each interval only covers 4MB, do I need to include some paddings using the -ip argument? If I do not pad, any side effects? If I do, will the overlapping region produce duplicated entries?

  • manolismanolis Member ✭✭

    @Sheila said:
    @jianxinwang
    Hi,

    In this thread the team recommends running on 1 interval per sample. In your case, you have 1500 samples, so you would have ~1500 intervals. Looking at the intervals list for hg38 that we provide, it looks like there are over 3000 intervals. I think you should be fine running on each of those intervals. If not, feel free to split some of the larger intervals into smaller ones :smile:

    -Sheila

    Hi Sheila,

    in my case I have 40 WES and 234.000 smalls intervals (they are not consecutive intervals (I can't merge them), and this list is the same as my targeted interval list during library preparation)... is it a problem? otherwise, as you suggested I have to split the total of chromosomes in 40 regions...!?

    Thanks

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @jianxinwang,

    For importing WGS GVCFs, we showcase a joint discover workflow that splits the genome into 20,361 intervals. The workflow is at https://github.com/gatk-workflows/gatk4-germline-snps-indels/blob/master/joint-discovery-gatk4.wdl and the ~20K intervals list is listed in https://github.com/gatk-workflows/gatk4-germline-snps-indels/blob/master/joint-discovery-gatk4.hg38.wgs.inputs.json at:

    "JointGenotyping.unpadded_intervals_file": "gs://gatk-test-data/intervals/hg38.even.handcurated.20k.intervals",
    

    All of the gatk-workflow repository example files are publically accessible. You can use gsutil to download or view the gs:// files. E.g. gsutil cp gs://gatk-test-data/intervals/hg38.even.handcurated.20k.intervals ~/Downloads/ copies the intervals to your Downloads folder. Here are two example commands to inspect the file:

    I think what @Sheila is saying is for you to increase the number of intervals for larger cohorts such that your number of intervals is equivalent to one per sample. For example, the genome is divided into ten parts for ten samples or a hundred parts for a hundred samples. For your 1500 samples, 1500 intervals should be appropriate. You can start with the 20K intervals list and merge every ten consecutive intervals for a 2K intervals list appropriate for your cohort size.

    You can see that the WDL script calls the DynamicallyCombineIntervals task to do so:

    Int num_of_original_intervals = length(read_lines(unpadded_intervals_file))
      Int num_gvcfs = length(read_lines(sample_name_map))
    
      # Make a 2.5:1 interval number to samples in callset ratio interval list
      Int possible_merge_count = floor(num_of_original_intervals / num_gvcfs / 2.5)
      Int merge_count = if possible_merge_count > 1 then possible_merge_count else 1
    
      call DynamicallyCombineIntervals {
        input:
          intervals = unpadded_intervals_file,
          merge_count = merge_count,
          docker_image = python_docker
      }
    
      Array[String] unpadded_intervals = read_lines(DynamicallyCombineIntervals.output_intervals)
    
      scatter (idx in range(length(unpadded_intervals))) {
        # the batch_size value was carefully chosen here as it
        # is the optimal value for the amount of memory allocated
        # within the task; please do not change it without consulting
        # the Hellbender (GATK engine) team!
        call ImportGVCFs {
          input:
            sample_name_map = sample_name_map,
            interval = unpadded_intervals[idx],
            workspace_dir_name = "genomicsdb",
            disk_size = medium_disk,
            docker_image = gatk_docker,
            gatk_path = gatk_path,
            batch_size = 50
        }
    
    

    And the corresponding task details are in the following code snippet:
    https://github.com/gatk-workflows/gatk4-germline-snps-indels/blob/6d422f644f80237e6102d54756a650f2d2e4e2de/joint-discovery-gatk4-fc.wdl#L880-L944

    I hope this is helpful.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @manolis,

    For you exome data, I don't think it is necessary to split the data at all. Remember GenomicDBImport takes a contig at a time and so your intervals would consist of the contig list for your data.

  • sam0persam0per Member

    Hi,

    I am assuming I could merge consecutive intervals after the variant call with Haplotypecaller and before GenomicsDBImport. If this is true, then I have a couple of embarrassing questions :)

    I would like to use DynamicallyCombineIntervals but my beginner knowledge about WDL scripts requires a little bit more of information that I hope the GATK staff could provide. Here are the questions:

    1) Are the above task for DynamicallyCombineIntervals and the snippet posted by @shlee sufficient to run the script? Could you advise me how to use just the DynamicallyCombineIntervals, please?

    2) What is the command to run the WGS workflow? I could not find any example on the github page

    Issue · Github
    by Sheila

    Issue Number
    3061
    State
    closed
    Last Updated
    Assignee
    Array
    Closed By
    chandrans
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @sam0per
    Hi,

    Sorry for the delay. I need to confirm a couple of things and get back to you.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
    edited May 2018

    @sam0per
    Hi,

    Are you asking how to run this in WDL? It looks like your task is in Python. You should use Cromwell and WDL OR Python. For help with running WDL, please post to the WDL forum. For Python, we cannot help you.

    -Sheila

Sign In or Register to comment.