Attention:
The frontline support team will be unavailable to answer questions on April 15th and 17th 2019. We will be back soon after. Thank you for your patience and we apologize for any inconvenience!

Where is "known_indels_sites_VCFs" defined?

freekfreek Member
edited May 2018 in Ask the GATK team

Dear GATK team,

I have been translating your wdl files into shell scripts to map them better to the scheduler on our Linux cluster (shell scripts are not already available anywhere, are they?).

At some point in the PairedEndSingleSampleWf.wdl you reference known_indels_sites_VCFs, I thought this array would be defined in JointGenotypingWf.hg38.inputs.json however the name known_indels_sites_VCFs is not specifically mentioned there and the files listed under "##_COMMENT4": "KNOWN SITES RESOURCES" are not only known indel sites, but also snps. So my question is: Is known_indels_sites_VCFs this entire list or some subset of said list? If it is a subset, where is it defined?

Highest regards,

Freek

Best Answers

Answers

  • freekfreek Member
    edited May 2018

    On a second note, I can also not find wgs_coverage_interval_list, it is needed (among others) by CollectRawWgsMetrics however JointGenotypingWf.hg38.inputs.json only defines:

      "##_COMMENT2": "INTERVALS", 
      "JointGenotyping.call_interval_list": "gs://broad-references/hg38/v0/wgs_calling_regions.v1.interval_list",
      "JointGenotyping.eval_interval_list": "gs://broad-references/hg38/v0/wgs_evaluation_regions.hg38.interval_list",
    "JointGenotyping.unpadded_intervals_file": "gs://gatk-test-data/intervals/hg38.even.handcurated.20k.intervals",
    

    Am I missing something?
    (Is there by the way, a reason for JointGenotyping.unpadded_intervals_file not being included in the same folder as the rest of the files?)

  • freekfreek Member
    edited May 2018

    And one more remark, the file gs://broad-references/hg38/v0/wgs_calling_regions.v1.interval_list does not seem to be present in the files bucket here: https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0, or, at least a "v1" was added to the file name. The filewgs_evaluation_regions.hg38.interval_list is not at all present in the bucket.

    Am I misunderstanding the relation between the wdl, the json and the google cloud bucket?

  • freekfreek Member
    edited May 2018 Accepted Answer

    I have found the definition of known_indels_sites_VCFs in PairedEndSingleSampleWf.hg38.inputs.json. I fear I was looking in the wrong json before (JointGenotypingWf). My apologies. (This also goes for my other remarks.)

    Although I still can't find the contamination_sites files in the bucket but I will ask that elsewhere.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @freek
    Hi Freek,

    Sorry for the delay. Glad you figured some stuff out on your own. I am checking with the team on some things, but am I correct in assuming you have found all files you need except for a contamination file? Can you specify exactly which contamination file you are looking for? It would be best to keep all questions in one thread, I think.

    Thanks,
    Sheila

  • freekfreek Member
    edited May 2018

    Hi @Sheila
    Thanx for your response, I'm currently checking again if I have everything.

    I have been using the files from here: https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0?pli=1 for a pipeline based on broad-prod-wgs-germline-snps-indels (I try to implement it exactly in bash).

    My first question would be: Is this a correct source, is my expectation that I should find all (auxiliary) files (like references) there correct? This source is not versioned, how do I know if it has changed? How do I cite this source? I used to build on GRCh38, which has subversions like GRCh38.87 but the bucket files simply state "Homo_sapiens_assembly38.fasta".

    I was indeed missing the contamination_sites_ud, contamination_sites_bed and contamination_sites_mu. But I also see now that CheckContamination is not a GATK or Picard function, it is a step in which VerifyBamID is used and then some Python to create a file that HaplotypeCaller uses. I also wonder how to implement this, I assume VerifyBamID is a third party tool I should install next to gatk4?

    I will let you know if I run into any missing files soon.

    Highest regards, and thank you, I really appreciate this forum and I'm learning a lot!

    Freek.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
    Accepted Answer

    @freek
    Hi Freek,

    Sorry for the confusion. These hg38 files are provided "as-is", but I think you saw in another thread that https://console.cloud.google.com/storage/browser/broad-references/hg38/v0 is where all the files are stored for now. The team is working on making this obvious and making it less confusing for users.

    I assume VerifyBamID is a third party tool I should install next to gatk4?

    Yes.

    -Sheila

Sign In or Register to comment.