manba and picard_gatk_mj questions

SystemSystem Administrator admin
edited December 2018 in Ask the GATK team
This discussion was created from comments split from: Can the CNV workflow use for WES data?.

Comments

  • manbamanba Member ✭✭
    edited December 2018

    You may want to adjust some parameters, e.g. --minimum-interval-median-percentile to a lower number like 5.0 instead of the default 10.0 to retain more data points given the samples are from the same individual.

    @Sheila is there some similar needed adjustment in SNV Pon?

    @Sheila I think you can use either biological or technical replicates of RMS strain samples towards the PoN. you mean one normal sample, sequenced by 100 times, can means I create a pon with 100 samples when you say "technical replicates is also ok"?

    @Sheila when you are not that busy, I am urgent for you help in the link
    https://gatkforums.broadinstitute.org/gatk/discussion/23228/pon-filter-get-more-sites-different-bed-file#latest

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hi @manba,

    Sheila has left the group for greener pastures. You can express your appreciation in the blog she wrote at https://software.broadinstitute.org/gatk/blog?id=12887 and I can also let her know.

    As for your question on the Mutect2 PoN, please can you post to a fresh thread? Our new full-time frontline support specialists can then field your question. To quickly answer, remember the Mutect2 PoN is used to filter out sites of disinterest and is generally made with variant sites present in two or more normal samples. In contrast, the CNV PoN is meant to define what is normal in terms of coverage.

  • picard_gatk_mjpicard_gatk_mj Member
    edited December 2018
  • manbamanba Member ✭✭

    @shlee you not sleep, thanks a lot . is this not written by you https://software.broadinstitute.org/gatk/documentation/article?id=11136.

    I am encountered with big problems with muetc2 results. My question is mainly posted in https://gatkforums.broadinstitute.org/gatk/discussion/23228/pon-filter-get-more-sites-different-bed-file#latest. i really hope you can help me

  • shleeshlee CambridgeMember, Broadie, Moderator admin
    edited December 2018

    Hi @manba,

    @shlee you not sleep, thanks a lot . is this not written by you https://software.broadinstitute.org/gatk/documentation/article?id=11136.

    I'm working remotely from a different time zone. Yes, I did develop and write Article#11136. However, my main role is not frontline support but documentation. So you will have to wait for @bhanuGandham, who is GATK's fulltime frontline support person, or someone else from the frontline support team to answer your question(s).

    Actually, I now see you've posted a series of sixteen questions plus comments in the link you shared at https://gatkforums.broadinstitute.org/gatk/discussion/23228/pon-filter-get-more-sites-different-bed-file#latest. One of our developers, @davidben, has actually personally answered your first question in the thread. That was really really nice of them. I see you've been an active GATK community member since exactly a month ago, starting November 21st. Since then you've posted 107 times!

    I'm sorry to see you are having difficulties with our tools and have so many questions. You should know the GATK Communications team is a small team and we do our best to enable scientists to use our software. You could really help us help you when posting questions by following the guidelines outlined in https://software.broadinstitute.org/gatk/blog?id=9063. Also, may I suggest you frame your questions in relation to your scientific research? For example, typically researchers who post questions give some background on the experimental question they are trying to answer, the GATK tool or workflow they are using plus any commands, and details about the error or unexpected result they see.

    Given your many questions, I think you may benefit from attending a GATK bootcamp. Our bootcamp/workshop schedule is at https://software.broadinstitute.org/gatk/events/. You can also ask your institute to schedule a GATK workshop or try to attend a GATK event at an ASHG or AGBT conference. In leiu of attending a workshop, I'd like to point you to GATK workshop materials under the Slides and workshop tutorial bundles section on the Presentations page. On this page you will also find links to YouTube videos that explain GATK tools and background context, including the entirety of workshop presentations. I think you will find these helpful.

    I mention these presentation materials because starting tomorrow the Broad Institute is closed until the new year. That means no one will be answering questions the forum until January 2. I hope you can continue in your research during this time despite this.

    Happy holidays,
    Soo Hee

  • manbamanba Member ✭✭
    edited December 2018

    @shl__ee I really thank you very much for your suggestion and all the help from other staff from Broad, all your help really light my misable life, because I started to do somatic variant from last month center, I really was Strugglled to undertand my result, but just because my low ability, your tool is very nice.

    Precisely because I know you are going to have holiday, so I post a lot to attract you to help me

    I am a woker, not a student now, so I may not have time to go to bootcamp.

    My the most important question is what do SNV pon do. because after testing a pon(made by more than 40 samples) of many tumot samples . I found unbelevable result, i do the FilterMutectCalls and FilterByOrientationBias. and compare the final vcf of with or without pon(I mean I add the pon in the mutect2 command, so I want to compare the effect with or without pon, I hope you understand my meaning, I not use pon wrongly)

    _comparing the variant sites in the two kind of vcf(with or withput pon)
    the reslut is : _

    comparing with the without pon vcf, there are_ some sites dropped_(range from 0.003 or more less to 0.8, if percentage, please multuply 100 );

    comparing with the without pon vcf,_** there are some new sites appearing**_(range from 1 to 10 sites), this is what I really can not undertand, because in your doc comparing gatk3 and gatk4 mutect2, you said remove pon sites remains unchanged

    comparing with the without pon vcf, some sites labeled with 'panel of pon',

    in both kind of vcf, many sites labeled with "base_quality;clustered_events;germline_risk;t_lod " and the filter cretia you mentioned in the doc "how to use mutect2" and your mathmatric of mutect2 in github.

    but none of my serveral hundred of samples vcfs containing the "PASS" in the FILTER column.
    I am a liittle affraid why this happend.
    more, you calculated the sites filtered by differerent labels solo or superposes in the doc "how to use mutect2", but why some labeles still kept, but some filtered automatically, you calculation is manually or automatically or both. or do you think a true variant should just have the 'PASS' in the FILTER column.

    and mathmatric of mutect2 in github did not clearly give the standard of when is value is, it should labels or filter automatically

    thanks a lot. your are really a very understanding student, when I knew you are a student in one of your answer in other's question.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @manba It would be very helpful to include your command line(s), using backticks (```) to mark code blocks. For example:

    gatk Mutect2 -R $ref -L $intervals \
       -I tumor.bam -tumor HCC1143 \
      -O calls.vcf
    

    This is extremely important because if your command line is wrong and you are simply running the tool incorrectly we can solve the issue immediately. Similarly, it's a good idea to run the latest release of GATK at all times so you don't re-discover bugs that have already been fixed.

    Also, it's usually a good idea to paste short examples of vcf output that illustrate your concern.

    Finally, when you have questions involving particular variant calls (or non-calls) it is often helpful to include a screenshot from IGV with both the original bam's reads loaded as well as the --bamout bam output from Mutect2 or HaplotypeCaller. If you peruse forum posts on Mutect2 you will see a lot of examples of good posting style. In particular you could start by searching for posts from SkyWarrior and dayzcool.

    By the way, @shlee is a Broad Institute DSP science writer with a PhD in microbiology. The comms team as a whole knows much more biology than the methods group, in fact.

  • manbamanba Member ✭✭

    @davidben Merry Christmas to you and your family and all the people in this forum ever helped me.

    I have send the screenshot in some of question, my command is not wrong, because the commad is not complicated . since you have sleeped in this time and start enjoyed your holiday, so I will give up this until you back.

  • picard_gatk_mjpicard_gatk_mj Member
    edited December 2018


    Mutect2 can use either to filter sites before reassembly, , will this be a reason that new variant sites appearing in vcf after pon(I mean compare the vcf with ot without pon after FilterMutectCalls FilterByOrientationBias )

    If a PoN or matched normal is provided, Mutect2 can use either to filter sites before reassembly, and it can use a germline resource to_ filter alleles_.

    and you obsviously distinguish the word "sites alleles", why?

  • picard_gatk_mjpicard_gatk_mj Member
    edited December 2018

    about the argument
    in Mutect2
    --normal-lod
    2.2 LOD threshold for calling normal variant non-germline.

    --tumor-lod-to-emit
    -emit-lod 3.0 LOD threshold to emit variant to VCF.

    in FilterMutectCalls
    --tumor-lod
    5.3 LOD threshold for calling tumor variant

    --normal-artifact-lod
    0.0 LOD threshold for calling normal artifacts

    but you know whether the** t_lod** appear or not in the FILTER column is using 5.3 cutoff.

    puzzled, these four parameter has any relation? is there a clear calculation between them?


    and I think here maybe not right, a lot of the sites lower than 5.3 just labeled with t_lod in the column of FILTER, or at least, not all the sites filtered, alot labeled

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @picard_gatk_mj

    and you obsviously distinguish the word "sites alleles", why?

    That's intentional. The panel of normals filters by site in that if the panel has an A->C at chr4:10045 then an A->T at chr4:10045 will also be filtered. The idea is that the site is prone to errors (especially mapping artifacts) and that no alt allele can be trusted.

    In contrast, we use the information from gnomAD by allele because the frequency of one allele has no bearing on the frequency of another.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @picard_gatk_mj

    and I think here maybe not right, a lot of the sites lower than 5.3 just labeled with t_lod in the column of FILTER, or at least, not all the sites filtered, alot labeled

    In the VCF format a label is the same as a filter. Only variants marked PASS are considered good.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @picard_gatk_mj

    puzzled, these four parameter has any relation? is there a clear calculation between them?

    They are mainly independent. The normal lod threshold, for example, is a threshold on the likelihood that a variant is a real germline mutation. The normal artifact lod, on the other hand, is the likelihood that an allele is an artifact in the normal, which suggests that it is also an artifact in the tumor.

    The tumor emission lod is a lower threshold than the tumor lod that lets you see variants that didn't quite pass the lod threshold in the final output. You could use it, for example, to generate a ROC curve of sensitivity vs. precision.

  • manbamanba Member ✭✭
    edited December 2018

    @davidben thanks you very much for your help, even you have enjoyed holidays, you are still worried about my miserable life, about your answer,
    **_"
    and I think here maybe not right, a lot of the sites lower than 5.3 just labeled with t_lod in the column of FILTER, or at least, not all the sites filtered, a lot labeled

    In the VCF format a label is the same as a filter. Only variants marked PASS are considered good.
    "_**

    **_the filter I mean is the column name vcf "FILTER", the arrow refers to FILTER, the rectangular refers to labels.

    I know in the FILTER, there can be only two kinds of labels. "PASS" or "not PASS, just like germline_risk, base_qulity, etc"_**

    as I know, and just as you said, Only variants marked PASS are considered good.. so what I can not understand is_** "why some site dropped, for example, my site1, why some kept, for example site2, but they both labeled"**_

    _site1: _
    chr4 55602765 . G A,C . germline_risk DP=314;ECNT=1;POP_AF=1.000e-03,1.000e-03;P_GERMLINE=-2.807e-03,-2.746e-04;TLOD=4.92,6.57 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:OBAM:OBAMRC:OBF:OBP:OBQ:OBQRC:SA_MAP_AF:SA_POST_PROB 0/1/2:301,4,5:0.020,0.023:300,4,5:0,0,0:34,36:197,197,197:60,60:38,46:false:false:.,.:.,.:100.00:36.73:0.020,0.00,0.016:3.071e-03,5.067e-03,0.992

    site2:
    "
    chr2 212578379 . TAAA T,TA,TAA,TAAAA,TAAAAA,TAAAAAA . base_quality;clustered_events;germline_risk;multiallelic DP=1152;ECNT=7;POP_AF=1.000e-03,1.000e-03,1.000e-03,1.000e-03,1.000e-03,1.000e-03;P_GERMLINE=-2.170e-04,-2.169e-04,-2.169e-04,-2.169e-04,-2.169e-04,-2.439e-04;RPA=14,11,12,13,15,16,17;RU=A;STR;TLOD=9.40,109.09,440.44,194.17,15.62,6.90 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:OBAM:OBAMRC:SA_MAP_AF:SA_POST_PROB 0/1/2/3/4/5/6:506,11,82,260,139,18,7:0.011,0.076,0.246,0.141,0.037,0.028:506,11,82,260,139,18,7:0,0,0,0,0,0,0:16,30,31,28,28,31:180,180,180,180,180,180,180:60,60,60,60,60,60:53,29,31,8,1,1:false:false:0.253,0.242,0.254:0.062,5.337e-03,0.933
    "

    My question always focused on "why all labeled, but some dropped automatically, some just kept", another is why some new sites appear in vcf after pon?

    and all my sites have no "PASS" in the column of "FILTER", so it means there is no good variant, do you think it is abnormal(hundreds of samples)?

    thanks a lot

    the below is in a sample, all the sites dropped automatically, not just labeled, have you found any rules?

    chr4 55602765 . G A,C . germline_risk DP=314;ECNT=1;POP_AF=1.000e-03,1.000e-03;P_GERMLINE=-2.807e-03,-2.746e-04;TLOD=4.92,6.57 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:OBAM:OBAMRC:OBF:OBP:OBQ:OBQRC:SA_MAP_AF:SA_POST_PROB 0/1/2:301,4,5:0.020,0.023:300,4,5:0,0,0:34,36:197,197,197:60,60:38,46:false:false:.,.:.,.:100.00:36.73:0.020,0.00,0.016:3.071e-03,5.067e-03,0.992

    chr4 55972974 . T A . germline_risk DP=107;ECNT=1;POP_AF=1.000e-03;P_GERMLINE=-2.169e-04;TLOD=149.13 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:OBAM:OBAMRC:OBF:OBP:OBQ:OBQRC:SA_MAP_AF:SA_POST_PROB 0/1:60,47:0.439:60,47:0,0:34:203,202:60:20:false:false:.:.:50.57:100.00:0.00,0.434,0.439:0.028,0.025,0.947

    chr5 112175770 . G A . clustered_events;germline_risk DP=273;ECNT=4;POP_AF=1.000e-03;P_GERMLINE=-2.169e-04;TLOD=778.06 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:OBAM:OBAMRC:OBF:OBP:OBQ:OBQRC:SA_MAP_AF:SA_POST_PROB 0/1:52,216:0.803:0,0:52,216:34:183,183:60:33:false:false:.:.:100.00:36.73:0.798,0.798,0.806:0.075,0.012,0.913

    chr7 55249063 . G A . germline_risk DP=405;ECNT=1;POP_AF=1.000e-03;P_GERMLINE=-2.169e-04;TLOD=145.10 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:OBAM:OBAMRC:OBF:OBP:OBQ:OBQRC:SA_MAP_AF:SA_POST_PROB 0/1:332,67:0.176:0,0:332,67:29:221,221:60:29:false:false:.:.:100.00:36.73:0.162,0.121,0.168:5.490e-03,0.030,0.965

    chr7 128845088 . A G . clustered_events;germline_risk DP=1373;ECNT=3;POP_AF=1.000e-03;P_GERMLINE=-2.169e-04;TLOD=366.43 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:OBAM:OBAMRC:OBF:OBP:OBQ:OBQRC:SA_MAP_AF:SA_POST_PROB 0/1:1183,162:0.129:1181,162:0,0:35:178,178:60:38:false:false:.:.:38.28:100.00:0.121,0.040,0.120:1.473e-03,0.326,0.673

    chr7 128846328 . G C . germline_risk DP=1091;ECNT=1;POP_AF=1.000e-03;P_GERMLINE=-2.169e-04;TLOD=2318.38 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:OBAM:OBAMRC:OBF:OBP:OBQ:OBQRC:SA_MAP_AF:SA_POST_PROB 0/1:428,650:0.603:428,650:0,0:38:202,202:60:57:false:false:.:.:57.39:100.00:0.596,0.586,0.600:0.019,0.019,0.963

    chr8 38285913 . GTCA G . germline_risk;str_contraction DP=881;ECNT=1;POP_AF=1.000e-03;P_GERMLINE=-2.169e-04;RPA=6,5;RU=TCA;STR;TLOD=153.96 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:OBAM:OBAMRC:SA_MAP_AF:SA_POST_PROB 0/1:806,54:0.074:806,54:0,0:36:186,186:60:52:false:false:0.061,0.00,0.063:2.134e-03,0.106,0.892

    chr9 139399445 . G A . clustered_events;germline_risk;t_lod DP=1687;ECNT=3;POP_AF=1.000e-03;P_GERMLINE=-1.216e-02;TLOD=4.25 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:OBAM:OBAMRC:OBF:OBP:OBQ:OBQRC:SA_MAP_AF:SA_POST_PROB 0/1:1661,6:0.013:1661,6:0,0:35:210,210:60:31:false:false:.:.:100.00:36.73:0.010,0.010,3.589e-03:0.427,1.024e-04,0.573

    chr10 43613843 . G T . germline_risk DP=82;ECNT=1;POP_AF=1.000e-03;P_GERMLINE=-2.169e-04;TLOD=193.42 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:OBAM:OBAMRC:OBF:OBP:OBQ:OBQRC:SA_MAP_AF:SA_POST_PROB 0/1:24,58:0.707:0,0:24,58:33:211,212:60:20:true:false:0.00:0.00:100.00:47.72:0.00,0.707,0.707:0.030,0.025,0.945

    chr11 534242 . A G . clustered_events;germline_risk DP=2457;ECNT=3;POP_AF=1.000e-03;P_GERMLINE=-2.169e-04;TLOD=3440.18 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:OBAM:OBAMRC:OBF:OBP:OBQ:OBQRC:SA_MAP_AF:SA_POST_PROB 0/1:1309,1093:0.453:1309,1093:0,0:37:185,185:60:57:false:false:.:.:38.28:100.00:0.455,0.444,0.454:0.077,5.277e-03,0.918

  • manbamanba Member ✭✭
    edited December 2018

    INFO=<ID=POP_AF,Number=A,Type=Float,Description="population allele frequencies of alt alleles">

    INFO=<ID=P_GERMLINE,Number=A,Type=Float,Description="Posterior probability for alt allele to be germline variants">

    why P_GERMLINE seems to be_** a minus_ in vcf.
    POP_AF=1.000e-03 seems to be a constant "1.000e-03' in vcf. I do not use argument --germline-resource.
    I see someone use this argument, the value of POP_AF is 2.500e-6, and this value is the value she set.
    **but I do not use this argument, why I have this.

    INFO=<ID=RPA,Number=.,Type=Integer,Description="Number of times tandem repeat unit is repeated, for each allele (including reference)">

    INFO=<ID=RU,Number=1,Type=String,Description="Tandem repeat unit (bases)">

    DP=254;ECNT=1;POP_AF=1.000e-03;P_GERMLINE=-6.579e-02;RPA=8,7;RU=A;STR;TLOD=3.49.
    why RU is just one kind, but RPA has two value, 8, 7

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @manba I'm confused as to what you mean by "dropped", because you are showing sites with all the annotations that come with the vcf output of Mutect2 and FilterMutectCalls, which means they were output. Let me reiterate that FilterMutectCalls does not remove any sites from the output of Mutect2. It only applied filters to the vcf FILTER field.

    Also, please use the most recent version of our tools, and please include the command line you use when asking questions. If you are worried about reproducibility I would suggest you just re-run old results with the latest version. The official Mutect2 workflow costs only $1 for a 60x WGS tumor-normal pair, and only about 50 cents if you don't need orientation bias filtering. Results for 4.0.11 are much better than 4.0.0, which were already quite good.

    why RU is just one kind, but RPA has two value, 8, 7?

    You already answered your own question when you copied the VCF header lines:

    <ID=RU,Number=1,Type=String,Description="Tandem repeat unit (bases)">
    <ID=RPA,Number=.,Type=Integer,Description="Number of times tandem repeat unit is repeated, for each allele (including reference)">

  • manbamanba Member ✭✭

    DP=153;ECNT=1;POP_AF=1.000e-03;P_GERMLINE=-2.169e-04;RPA=10,9;RU=T;STR;TLOD=9.83

    here I said : RPA means for each allele, so here RPA has two values ,meaning two allle, but RU has just one value, meaning the RU is just T here, I do not see two.

    thanks a lot

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    The alleles differ in the number of repeat units (RPA) but the unit (RU) is the same. Thus RPA=10,9;RU=T uniquely defines a reference homopolymer TTTTTTTTTT (10 copies) and an alt homopolymer TTTTTTTTT (9 copies).

Sign In or Register to comment.