Holiday Notice:
The Frontline Support team will be offline December 17-18 due to an institute-wide retreat and offline December 22- January 1, while the institute is closed. Thank you for your patience during these next few weeks as we get to all of your questions. Happy Holidays!

Questions about PublicPairedSingleSampleWf_170412.wdl

Hi there,

I was following the tutorial (How to) Map and clean up short read sequence data efficiently when I discovered the link to the complete protocol in WDL. After having run several steps, I have a few questions you may be able to help me with:

1) for the MergeBamAlignment step,

a) should I be using the uBAM file produced in the FastqToSam step or in the MarkIlluminaAdapters step?

b) what is the goal of the option ATTRIBUTES_TO_RETAIN=X0 and what would be the difference if using ATTRIBUTES_TO_RETAIN=XS as in the tutorial?

c) before discovering the WDL script, I had already run all my samples up to the SortAndFixSampleBam step (the one after MarkDuplicates) using the uBAM produced by the FastqToSam step and the option ATTRIBUTES_TO_RETAIN=XS for MergeBamAlignment, would this produce any unwanted or unexpected effect on the remaining portion of the WDL pipeline? Can I continue as is or should I fix the files? If so, how?

2) I am very interested on the Identity Validation of the samples for Quality Control as you mentioned here.

a) Could you tell me which platform/kit do you for performing the genotyping array for Fingerprinting?

b) When checking the task CheckContamination for cross-sample contamination, it points to the file WholeGenomeShotgunContam.vcf which I was unable to find in the GATK bundle or understand how it is produced. Could you give me a hint on that?

3) When checking on the bwa mem options used, I've noticed the following option "bwa mem -K 100000000" for which I was unable to find any documentation (online or from the command). Could you tell me what is it for?

Thank you very much in advance for any help you could provide me.

Best regards,
Santiago

Issue · Github
by Sheila

Issue Number
2136
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
sooheelee

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @santiagorevale
    Hi Santiago,

    I moved your question to ask the WDL team where @KateN will help you.

    -Sheila

  • santiagorevalesantiagorevale ArgentinaMember

    Hi there!

    Were you able to check any of the previous questions?

    Regarding question 2b) I was able to download the file, but I still would like to know how did you create it.

    Thanks in advance.

    Best regards,
    Santiago

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @santiagorevale
    Hi Santiago,

    I am sorry I moved this to the WDL forum, but your questions are indeed GATK related! I will consult the team and get back to you asap.

    -Sheila

  • shleeshlee CambridgeMember, Broadie, Moderator admin
    edited June 2017

    Hi @santiagorevale,

    1a. The presented workflow uses the uBAM produced by FastqToSam.
    1b. ATTRIBUTES_TO_RETAIN relates to SAM alignment record tags you would like to carry-over from the alignment. These depend on what tags you need for your analyses. The tutorial gives the XS tag as an example to highlight the ATTRIBUTES_TO_RETAIN option.
    1c. See answer to 1b.
    2. You asking about the identity validation in the WDL is my first exposure to them within the WDL. So thanks for bringing that to my attention. I'll try to answer as best I can your questions related to these.
    2a. As far as I know, currently the Broad Genomics Platform does not use arrays for fingerprinting. In the past, SNP arrays were used. I believe the fingerprinting is used directly on sequencing data, at the read group level, to check for sample swaps, e.g. between tumor and matched normals. I could be wrong here.
    2b. I not familiar with the WholeGenomeShotgunContam.vcf. Sorry.
    3. Yes, I've asked the same question, what is the -K for in bwa mem commands. From my notes, I see the following explanation that I got from Heng Li:

    By default, bwa-mem loads a batch of reads into RAM to process. The number of loaded bases is proportional to the number of threads. If you use a different number of threads, the mapping results may be slightly different. This hurts reproducibility. -K disables the behavior by loading a fixed number of bases into RAM. It should not affect ALT mapping.

    I hope this is helpful.

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    @santiagorevale,
    I've consulted with our developers on 2a and 2b. I can only followup on 2b--the contamination file should be available from the VerifyBamId website.

  • krdavkrdav Member, Broadie

    @shlee
    I can't find those input vcf files on the VerifyBamId website, can you link to them? If I am not mistaken some people are using HapMap for this purpose.

    Issue · Github
    by Sheila

    Issue Number
    2558
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    sooheelee
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @krdav
    Hi,

    I will ask Soo Hee to get back to you.

    -Sheila

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hi @krdav,

    You'll have to ask the VerifyBamId folks for their files. This is not one of our tools.

    On a side note, Picard has some fingerprinting tools available. You can read about them at https://gatkforums.broadinstitute.org/gatk/discussion/9526/picard-haplotype-map-file-format, which also describes the Haplotype Map file format that the tools use.

Sign In or Register to comment.