Attention:
The frontline support team will be unavailable to answer questions until May27th 2019. We will be back soon after. Thank you for your patience and we apologize for any inconvenience!

manba and picard_gatk_mj questions

SystemSystem Administrator admin
edited December 2018 in Ask the GATK team
This discussion was created from comments split from: Can the CNV workflow use for WES data?.

Comments

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @manba,

    Sheila has left the group for greener pastures. You can express your appreciation in the blog she wrote at https://software.broadinstitute.org/gatk/blog?id=12887 and I can also let her know.

    As for your question on the Mutect2 PoN, please can you post to a fresh thread? Our new full-time frontline support specialists can then field your question. To quickly answer, remember the Mutect2 PoN is used to filter out sites of disinterest and is generally made with variant sites present in two or more normal samples. In contrast, the CNV PoN is meant to define what is normal in terms of coverage.

  • picard_gatk_mjpicard_gatk_mj Unconfirmed
    edited December 2018
  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited December 2018

    Hi @manba,

    @shlee you not sleep, thanks a lot . is this not written by you https://software.broadinstitute.org/gatk/documentation/article?id=11136.

    I'm working remotely from a different time zone. Yes, I did develop and write Article#11136. However, my main role is not frontline support but documentation. So you will have to wait for @bhanuGandham, who is GATK's fulltime frontline support person, or someone else from the frontline support team to answer your question(s).

    Actually, I now see you've posted a series of sixteen questions plus comments in the link you shared at https://gatkforums.broadinstitute.org/gatk/discussion/23228/pon-filter-get-more-sites-different-bed-file#latest. One of our developers, @davidben, has actually personally answered your first question in the thread. That was really really nice of them. I see you've been an active GATK community member since exactly a month ago, starting November 21st. Since then you've posted 107 times!

    I'm sorry to see you are having difficulties with our tools and have so many questions. You should know the GATK Communications team is a small team and we do our best to enable scientists to use our software. You could really help us help you when posting questions by following the guidelines outlined in https://software.broadinstitute.org/gatk/blog?id=9063. Also, may I suggest you frame your questions in relation to your scientific research? For example, typically researchers who post questions give some background on the experimental question they are trying to answer, the GATK tool or workflow they are using plus any commands, and details about the error or unexpected result they see.

    Given your many questions, I think you may benefit from attending a GATK bootcamp. Our bootcamp/workshop schedule is at https://software.broadinstitute.org/gatk/events/. You can also ask your institute to schedule a GATK workshop or try to attend a GATK event at an ASHG or AGBT conference. In leiu of attending a workshop, I'd like to point you to GATK workshop materials under the Slides and workshop tutorial bundles section on the Presentations page. On this page you will also find links to YouTube videos that explain GATK tools and background context, including the entirety of workshop presentations. I think you will find these helpful.

    I mention these presentation materials because starting tomorrow the Broad Institute is closed until the new year. That means no one will be answering questions the forum until January 2. I hope you can continue in your research during this time despite this.

    Happy holidays,
    Soo Hee

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @manba It would be very helpful to include your command line(s), using backticks (```) to mark code blocks. For example:

    gatk Mutect2 -R $ref -L $intervals \
       -I tumor.bam -tumor HCC1143 \
      -O calls.vcf
    

    This is extremely important because if your command line is wrong and you are simply running the tool incorrectly we can solve the issue immediately. Similarly, it's a good idea to run the latest release of GATK at all times so you don't re-discover bugs that have already been fixed.

    Also, it's usually a good idea to paste short examples of vcf output that illustrate your concern.

    Finally, when you have questions involving particular variant calls (or non-calls) it is often helpful to include a screenshot from IGV with both the original bam's reads loaded as well as the --bamout bam output from Mutect2 or HaplotypeCaller. If you peruse forum posts on Mutect2 you will see a lot of examples of good posting style. In particular you could start by searching for posts from SkyWarrior and dayzcool.

    By the way, @shlee is a Broad Institute DSP science writer with a PhD in microbiology. The comms team as a whole knows much more biology than the methods group, in fact.

  • picard_gatk_mjpicard_gatk_mj Unconfirmed
    edited December 2018


    Mutect2 can use either to filter sites before reassembly, , will this be a reason that new variant sites appearing in vcf after pon(I mean compare the vcf with ot without pon after FilterMutectCalls FilterByOrientationBias )

    If a PoN or matched normal is provided, Mutect2 can use either to filter sites before reassembly, and it can use a germline resource to_ filter alleles_.

    and you obsviously distinguish the word "sites alleles", why?

  • picard_gatk_mjpicard_gatk_mj Unconfirmed
    edited December 2018

    about the argument
    in Mutect2
    --normal-lod
    2.2 LOD threshold for calling normal variant non-germline.

    --tumor-lod-to-emit
    -emit-lod 3.0 LOD threshold to emit variant to VCF.

    in FilterMutectCalls
    --tumor-lod
    5.3 LOD threshold for calling tumor variant

    --normal-artifact-lod
    0.0 LOD threshold for calling normal artifacts

    but you know whether the** t_lod** appear or not in the FILTER column is using 5.3 cutoff.

    puzzled, these four parameter has any relation? is there a clear calculation between them?


    and I think here maybe not right, a lot of the sites lower than 5.3 just labeled with t_lod in the column of FILTER, or at least, not all the sites filtered, alot labeled

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @picard_gatk_mj

    and you obsviously distinguish the word "sites alleles", why?

    That's intentional. The panel of normals filters by site in that if the panel has an A->C at chr4:10045 then an A->T at chr4:10045 will also be filtered. The idea is that the site is prone to errors (especially mapping artifacts) and that no alt allele can be trusted.

    In contrast, we use the information from gnomAD by allele because the frequency of one allele has no bearing on the frequency of another.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @picard_gatk_mj

    and I think here maybe not right, a lot of the sites lower than 5.3 just labeled with t_lod in the column of FILTER, or at least, not all the sites filtered, alot labeled

    In the VCF format a label is the same as a filter. Only variants marked PASS are considered good.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @picard_gatk_mj

    puzzled, these four parameter has any relation? is there a clear calculation between them?

    They are mainly independent. The normal lod threshold, for example, is a threshold on the likelihood that a variant is a real germline mutation. The normal artifact lod, on the other hand, is the likelihood that an allele is an artifact in the normal, which suggests that it is also an artifact in the tumor.

    The tumor emission lod is a lower threshold than the tumor lod that lets you see variants that didn't quite pass the lod threshold in the final output. You could use it, for example, to generate a ROC curve of sensitivity vs. precision.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @manba I'm confused as to what you mean by "dropped", because you are showing sites with all the annotations that come with the vcf output of Mutect2 and FilterMutectCalls, which means they were output. Let me reiterate that FilterMutectCalls does not remove any sites from the output of Mutect2. It only applied filters to the vcf FILTER field.

    Also, please use the most recent version of our tools, and please include the command line you use when asking questions. If you are worried about reproducibility I would suggest you just re-run old results with the latest version. The official Mutect2 workflow costs only $1 for a 60x WGS tumor-normal pair, and only about 50 cents if you don't need orientation bias filtering. Results for 4.0.11 are much better than 4.0.0, which were already quite good.

    why RU is just one kind, but RPA has two value, 8, 7?

    You already answered your own question when you copied the VCF header lines:

    <ID=RU,Number=1,Type=String,Description="Tandem repeat unit (bases)">
    <ID=RPA,Number=.,Type=Integer,Description="Number of times tandem repeat unit is repeated, for each allele (including reference)">

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    The alleles differ in the number of repeat units (RPA) but the unit (RU) is the same. Thus RPA=10,9;RU=T uniquely defines a reference homopolymer TTTTTTTTTT (10 copies) and an alt homopolymer TTTTTTTTT (9 copies).

Sign In or Register to comment.