Holiday Notice:
The Frontline Support team will be offline February 18 for President's Day but will be back February 19th. Thank you for your patience as we get to all of your questions!

Mutect2 parallel problem

Dear GATK team.

I am using Mutect2 to call somatic mutation from tumor/normal paired sample. However after jobs running for 8 days, our server has been rebooted for some reason. Most of the jobs are done by more than 70%. For example, some jobs called variants at Chr14, some at Chr19, and it seems the variants calling are by chromosomes. May I ask is there a way to continue the unfinished part?

I used parallel option (-nct 4) and nonparallel option for the same jobs, but it turns out that the system used more time on communication among different threads, rather than actually speeding up. Parallel jobs are actually about 4 times slower than non-parallel jobs. Considering garbage collection problem, I added java -Xmx24G -XX:+UseConcMarkSweepGC -XX:ParallelGCThreads=4 ... in the command.

Could I submit Mutect2 for each chromosome, rather than whole genome? By submitting jobs for each chromosome can actually make the variants calling parallel.

Thanks,
Qingrun

Best Answer

Answers

  • EADGEADG KielMember ✭✭✭

    Hi @Qingrun_Zhang,

    I dont know if you can continue an unfinished job from mutect,2 maybe @Geraldine_VdAuwera can help you.

    It is known that mutect2/ and the HaplotypeCaller sometimes behave strange if the -nct argument is >1.

    You can limit mutect2 to an specific interval by using -L tag. Have a look at this FAQ about intervals: http://gatkforums.broadinstitute.org/gatk/discussion/1319/collected-faqs-about-interval-lists

    But I would recommend you to use wdl with scatter/gather to do this in parrallel over different regions. You can adopt the scatter/gather for the HaplotypeCaller from the PublicPairedSingleSampleWf-Pipeline (Best Practice PipeLine) which lives here: https://github.com/broadinstitute/wdl/tree/develop/scripts/broad_pipelines

    Hope that helps...

    Greetings EADG

  • Qingrun_ZhangQingrun_Zhang CalgaryMember

    Thanks, EADG.

  • EADGEADG KielMember ✭✭✭

    Hi @Qingrun_Zhang,

    you are wellcome :) Can you mark the question as answered (only if you have no further questions:))?

    Greetings EADG

  • steve1steve1 Member
    edited October 2017

    Is there an actual example of how to split the variant calling by chromosome? The page here:
    https://gatkforums.broadinstitute.org/gatk/discussion/1319/collected-faqs-about-interval-lists
    does not actually show you how to do it, it just talks a lot about all the possibilities... without giving examples. In particular, its not entirely clear to me how the -L argument is supposed to be formatted as a command line argument, and how exactly we're supposed to figure out the exact [chrom] [start] [stop] coordinates to use for a particular chromosome.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @steve1
    Hi,

    I hope this article helps.

    For start and stop coordinates, I am assuming you are working with exomes? If so, you can get an interval list from your sequencing provider that will provide the positions of interest/that are covered.

    -Sheila

  • xiang_zuoxiang_zuo shanghaiMember

    @Sheila said:
    @steve1
    Hi,

    I hope this article helps.

    For start and stop coordinates, I am assuming you are working with exomes? If so, you can get an interval list from your sequencing provider that will provide the positions of interest/that are covered.

    -Sheila

    @Sheila
    So how can I merge the calling result of each chromosome? Is there a corresponding tool in GATK, or just use cat to merge? Thanks!

  • xiang_zuoxiang_zuo shanghaiMember

    @xiang_zuo said:

    @Sheila said:
    @steve1
    Hi,

    I hope this article helps.

    For start and stop coordinates, I am assuming you are working with exomes? If so, you can get an interval list from your sequencing provider that will provide the positions of interest/that are covered.

    -Sheila

    @Sheila
    So how can I merge the calling result of each chromosome? Is there a corresponding tool in GATK, or just use cat to merge? Thanks!

    @Sheila
    I used version 3.7 and got to know CatVariants is useful for merging. thanks!

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hi @xiang_zuo,

    Please check out GATK4's GatherVcfsCloud. Despite the name, the tool also works locally.

  • steve1steve1 Member

    I am wondering if splitting the targets by chromosome can affect the outcome. We have been running some tests, and we are seeing slight differences in the results when we run with & without splitting. For example:

    Chr Start   End Ref Alt POS REF ALT QUAL    NORMAL.AD   NORMAL.AF   TUMOR.AD    FREQ    TUMOR.AD.TOTAL  NORMAL.AD.TOTAL Type
    chr14   36989419    36989419    C   A   36989419    C   A   1330.23 100,0   0   565,508 0.479   1073    100 5Chunk
    chr14   36989419    36989419    C   A   36989419    C   A   1334.17 100,0   0   566,508 0.479   1074    100 noChunk
    

    In this case we actually had 3 different methods; splitting targets into 5 equal chunks ("5chunk"), splitting by chromosome ("chromChunk"), and no splitting ("noChunk"). You can see that there are slight differences in the allelelic depths reported, along with the QUAL scores. And the variant did not even show up in the chromChunk output. Amongst other variants, we saw all combinations of variants showing up in one, two, or three of the categories, and often with small differences in the reported values. Any idea what is going on and if this is related to the splitting? Is this some kind of background-noise that is affecting the results independent of the splitting?

  • steve1steve1 Member

    @xiang_zuo said:

    @xiang_zuo said:

    @Sheila said:
    @steve1
    Hi,

    I hope this article helps.

    For start and stop coordinates, I am assuming you are working with exomes? If so, you can get an interval list from your sequencing provider that will provide the positions of interest/that are covered.

    -Sheila

    @Sheila
    So how can I merge the calling result of each chromosome? Is there a corresponding tool in GATK, or just use cat to merge? Thanks!

    @Sheila
    I used version 3.7 and got to know CatVariants is useful for merging. thanks!

    an alternative that I ended up using was to convert the vcf to tsv with VariantsToTable and then simply concatenate all the .tsv tables, since I needed the final product in .tsv format anyway.

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    @steve1

    You are right, this is due to background-noise that is affecting the results and independent of the splitting. This is something we observe with edge cases.

  • steve1steve1 Member
    edited February 14

    @bhanuGandham thanks for that explanation. Just to be clear, does splitting the target region have an effect at all on the variant calling results? If I have, for example, 10,000 target regions, would I get the same results if I ran all 10,000 regions individually in parallel independent processes vs. running all of them together vs. some other combination of splitting the target regions?

Sign In or Register to comment.