How to properly merge bamouts after Mutect2

dcampodcampo Los AngelesMember

Hi,
I ran Mutect2 (gatk4) using the -L command to split by chromosome, and the -bamout option.
I am now using CatVariants (gatk3) to merge all the vcf files (one per chr) into a final vcf.
And I also need to merge the bamouts. I know there are several tools to merge bams, but I've read that some of such tools have issues with the headers, etc.
Is there any particular tool you would recommend to properly merge the bams?

Thanks!

Tagged:

Best Answers

Answers

  • dcampodcampo Los AngelesMember

    Thanks Sheila!
    One last thing thing, I have 25 bam files per sample (one per chr), so I assume I can pass a list of input files, like for CatVariants and many other tools? Something like:
    I=listofbams.list

  • dcampodcampo Los AngelesMember

    Hi again,

    Turns out that MergeSamFiles does not accept a list of bams :(
    And when I tried entering manually the names of the 25 files, it quit with the error "SAM validation error: WARNING: Read name READNAME, No M or N operator between pair of D operators in CIGAR". Not related to using 25 file names though...

    Anyways, in case this helps, I ended up using bamtools (merge), which accepts a list of files, and does not complain about bam flags/operators, and it worked.

    Thanks!

    -Daniel

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @dcampo
    Hi Daniel,

    Hmm. Glad you found a workaround, but can you confirm all your input BAM files validate with ValidateSamFile?

    Thanks,
    Sheila

  • dcampodcampo Los AngelesMember

    I just ran it quickly on a couple of the bam files, and got very similar results:

    Error Type Count
    ERROR:MATES_ARE_SAME_END 115868
    ERROR:MATE_NOT_FOUND 1049691
    ERROR:MISMATCH_FLAG_MATE_NEG_STRAND 231736
    ERROR:MISMATCH_MATE_ALIGNMENT_START 756497
    ERROR:MISSING_PLATFORM_VALUE 1
    WARNING:ADJACENT_INDEL_IN_CIGAR 26
    WARNING:MISSING_TAG_NM 2944190

    Error Type Count
    ERROR:MATES_ARE_SAME_END 593830
    ERROR:MATE_NOT_FOUND 4464324
    ERROR:MISMATCH_FLAG_MATE_NEG_STRAND 1187660
    ERROR:MISMATCH_MATE_ALIGNMENT_START 3941863
    ERROR:MISSING_PLATFORM_VALUE 1
    WARNING:ADJACENT_INDEL_IN_CIGAR 68
    WARNING:MISSING_TAG_NM 13359515

    Is this bad? Do you think I should try to address those errors?
    I am thinking that these are just bamouts from Mutect2, so they are not going to be the input of anything else, but perhaps those errors have been carried from previous steps (I followed the good practices for RNAseq variant calling)?
    Though all those previous steps have run with no issues...
    Any advice?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @dcampo
    Hi,

    Ah, these are bamouts and not the BAM files you are using? Can you confirm the BAM files you are using in your analysis are valid? I don't know if bamout files should validate, but since they are not to be used in proper analysis, I guess it is okay.

    Thanks,
    Sheila

  • dcampodcampo Los AngelesMember

    Yes, those are bamouts from Mutect2.

    I have checked the 4 bam files that were used for those two Mutect2 runs (i.e. 2 tumor-normal pairs) with ValidateSamFile, and they all yield the same error:
    ERROR ValidateSamFile Value was put into PairInfoMap more than once. -1: READNAME_here

    Then I used "grep" to find that read in a couple of those bams, and I see the read twice in both cases, which I assume correspond to the mates, as this is paired-end data. One of them, for example:

    @A>;=<=?;;;;;=?;;><=?=:?=<<=<;<:;54332982..122 BD:Z:LKKJFFGHIIHAAHIIJIJIFGBHJJIFJIJLLKKIKGKJJIKILNNNRSTVVULLJJ PG:Z:MarkDuplicates RG:Z:101738_PB_FollowUp_S3-ID NH:i:1 BI:Z:MNNNIIJKMLKDDKMLJKKKHICIMKKHMMLMNMMLLINMLKOKNPPOQTSWYUOOMM HI:i:1 nM:i:0 AS:i:100 SN7001148:421:CA9APACXX:5:1216:7511:6820 89 chr1 12268706 60 843H42M * 0 0 CTGCAGATCCAGATGGCCCCGTTTTTGAGATGCTGTATGAGA >?=>?>>><>?>=><<;;;5<====<;<:<=:8;78680)<8 BD:Z:MMLLHJIIJJHJIIKKIIJJIBBCJKKKMMPQSQUUVVJJJJ PG:Z:MarkDuplicates RG:Z:101738_PB_FollowUp_S3-ID NH:i:1 BI:Z:OPOOKLLLLLKLMMMNJJMLJDEELMKNOPQQQQTWXWKNMM HI:i:1 nM:i:0 AS:i:100

    I've read that the above error may happen if there are more than one read with the same name, but this doesn't seem to be the case. I also checked the fastq files, and there's only one read with that name.
    Any idea what that error means then?

  • dcampodcampo Los AngelesMember
    edited February 22

    HI Sheila,

    an update on this: I ran ValidateSamFile on all the 91 bam files I am using in this analysis, and for 77 of them I get the error mentioned above (ERROR ValidateSamFile Value was put into PairInfoMap more than once. -1: READNAME_here).

    The other 14 output a table like this:

    Error Type Count
    ERROR:MATES_ARE_SAME_END 635
    ERROR:MATE_NOT_FOUND 1499
    ERROR:MISMATCH_FLAG_MATE_NEG_STRAND 1270
    ERROR:MISMATCH_MATE_ALIGNMENT_START 9343
    WARNING:MISSING_TAG_NM 725550

    Obviously the numbers differ between samples, but all 14 got the same four errors and one warning.

    I've run all downstream analysis (PON, somatic calling, filtering, etc) with these 91 files, and I did not get any errors or issues. Should I be worried about biased results??

    Issue · Github
    by Sheila

    Issue Number
    2968
    State
    closed
    Last Updated
    Assignee
    Array
    Closed By
    vdauwera
  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @dcampo
    Hi,

    For the first error, have a look at this thread.

    For the mate errors, have a look at this thread.

    Also, this article may help.

    -Sheila

  • dcampodcampo Los AngelesMember

    Thanks Sheila,

    Yes, I had read that article before. That's why I am now running ValidateSamFile in all my bams.
    So, if I understand correctly from what Geraldine says in the second thread ("Some of these mapping issues can't be fixed as such, and if so, what the tools do is they flag the problem reads so downstream tools know how to handle them (or ignore them) appropriately. I will check but I think you can move past these errors."), I can easily ignore those errors, as they will be properly handled by downstream tools, correct?

    I guess a more general question would be, if a downstream tool that uses those bam files does not crash (i.e. it finishes, and produces a non-empty output) due to those errors, can we assume that it has properly handled the problematic errors, and therefore we can trust the output?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @dcampo
    Hi,

    I think it is odd that the GATK tools ran to completion with those errors, but maybe I am missing something and they really are not that bad. I will need to check with the team before telling you to move on or not. Someone will get back to you soon.

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @dcampo, to be clear, bamouts are not at all intended for downstream analysis, only visualization and troubleshooting. As part of the bamout production process, we truncate reads and do all sorts of things that ValidateSamFile considers unhealthy with good reason. I strongly advise against using those files in downstream analysis. I can't predict exactly what will be the consequences because it's something that we've simply never investigated or quantified, but every fiber of my being is recoiling at the thought. No offense -- I understand why you might think it would be convenient to do this, and admittedly we never really thought to explicitly disclaim this usage -- but it just seems like a ticking time bomb to me.

  • dcampodcampo Los AngelesMember

    Thanks for the answer Geraldine, but I see there is some confusion here. I probably did not explain my self clear enough.
    Sheila asked me to validate the original bam files that I used for the analysis, and that's what I did. So the the bam files that are giving those errors I mention above are the ones generated by the pre-processing workflow (i.e. STAR 2-pass, add read groups, mark duplicates, base recalibration). And what I am trying to say is that, even with all those errors, the variant calling part of the pipeline (PON creation, somatic calling with Mutect2, filtering) ran with no issues or warnings.
    Hence my question is, can I trust the results?
    Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Ah sorry about that @dcampo, I misunderstood what you were trying to do -- I thought you were calling variants on the bamouts.

    Unfortunately I still can't give you the complete reassurance you're looking for -- it is indeed possible that the issues were harmless and your results are fine (especially if the number of reads involved is very small compared to the overall amount of data in your files) but the absence of a visible failure does not guarantee the absence of analytical errors. Frankly, what we normally would do in a case like this, for a production sample, would be to revert the data to its unprocessed form and redo everything from scratch. If you can't do that, then you will have to accept that there is a small but unquantifiable chance that there's something wrong in your results. If you have any validation data that can help you evaluate your results I would recommend looking extra carefully at that.

    More generally, our pipeline is set up to QC the data after every step in order to identify any issues that creep in right away. I would recommend you look into doing something similar (eg run ValidateSamFile on every bam input and output) for your future work.

    Good luck!

  • dcampodcampo Los AngelesMember

    Hi again!

    So I ran now ValidateSameFile on all bam files generated in the pipeline, and the output bams of STAR, Picard AddOrReplaceReadGroups, and Picard MarkDuplicate show no errors.
    The errors appear after SplitNCigarReads, and they are the same after BaseRecalibrator/PrintReads, so I assume they are just carried over.

    I found this thread:
    https://gatkforums.broadinstitute.org/gatk/discussion/7957/errors-when-running-picard-validatesamfile-on-bam-file-got-from-splitncigarreads
    which describes exactly what I am seeing, so at least I know I am not alone in the world :)
    Now, at the end of that thread, @Sheila says "The fix will only be in GATK4, not in GATK3."
    HOWEVER, in the workflow described here:
    https://github.com/gatk-workflows/gatk3-4-rnaseq-germline-snps-indels/blob/master/rna-germline-variant-calling.wdl
    in lines 368 and 369, it says that SplitNCigarReads has not been validated in GATK4, and indeed the workflow calls GATK 3.5 for only that step (everything else is GATK4).
    Does that mean that the issue described in the thread above was actually not fixed in GATK4?
    Do you have any recommendation as to how to proceed then?

    In case this helps, I am following the workflow described here:
    https://software.broadinstitute.org/gatk/documentation/article.php?id=3891
    with STAR 2.5 (2-pass basic mode), Picard 2.11, and GATK 3.8.
    My goal is to call somatic mutations from RNAseq (tumor-normal matched samples).

    THANKS!

  • dcampodcampo Los AngelesMember

    Hi, an update on my previous comment (see right above):

    I just ran GATK4 - SplitNCigarReads on a small sample I am using for troubleshooting, and I only get a minor warning:
    WARNING:MISSING_TAG_NM 721424
    No more ERRORS :)
    (Remember, when I ran GATK3 - SplitNCigarReads on the same input file, I get all the errors I described above, which are the same as in the thread I mentioned before).

    So, it seems that SplitNCigarReads was actually debugged in GATK4.
    But if so, why don't you use it on the workflow?
    https://github.com/gatk-workflows/gatk3-4-rnaseq-germline-snps-indels/blob/master/rna-germline-variant-calling.wdl
    and instead call GATK 3.5 only for that step?
    Is there anything wrong with GATK4 - SplitNCigarReads that I am missing?

    Thanks!

    Issue · Github
    by Sheila

    Issue Number
    2992
    State
    closed
    Last Updated
    Assignee
    Array
    Closed By
    vdauwera
  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @dcampo
    Hi,

    I have to check with the team and get back to you.

    -Sheila

  • dcampodcampo Los AngelesMember

    Well, if this helps, I have now implemented GATK4 for all the GATK part of my pipeline, and it run smoothly, with no issues. :smile:
    I have also run Mutect2 for somatic calling, and VEP for annotation afterwards, and everything looks fine.
    Thank you very much @Sheila and @Geraldine_VdAuwera for your help!!!

Sign In or Register to comment.