To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

MuTect2 and multiple read groups per sample, issue with index

MattBMattB NewcastleMember

I have an issue with BAM files containing multiple read groups for the same sample (identical SM:), essentially this is the same situation as my MuTect1 related question form 2 years ago: http://gatkforums.broadinstitute.org/gatk/discussion/3796/can-mutect-handel-multiple-read-groups - but here it appears to manifest an issue: they cause MuTect2 to throw the following error saying it can't detect index type:

##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 3.6-0-g89b7209): 
##### ERROR
##### ERROR This means that one or more arguments or inputs in your command are incorrect.
##### ERROR The error message below tells you what is the problem.
##### ERROR
##### ERROR If the problem is an invalid argument, please check the online documentation guide
##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
##### ERROR
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions https://www.broadinstitute.org/gatk
##### ERROR
##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
##### ERROR
##### ERROR MESSAGE: Problem detecting index type
##### ERROR ------------------------------------------------------------------------------------------

Regarding my read group set-up I have matched normal and tumor samples where each sample is sequenced across two or more lanes, following BWA I've merged my sequencing run BAM files such that I'm using a single per-sample BAM with the subsequent MarkDuplicates and BQSR steps, my read group structure is:

@RG ID:RG_1 LB:Lib_10_10065_DNA1_1H SM:10_10065_DNA1_1H PL:ILLUMINA @RG ID:RG_2 LB:Lib_10_10065_DNA1_1H SM:10_10065_DNA1_1H PL:ILLUMINA

Sometimes I have more than on library prep too, so I've encoded that as follows with 4 BAMs being merged into one per sample BAM here:

@RG ID:RG_3 LB:Lib_10_10065_F2_DNA2H_1 SM:10_10065_F2_DNA2H PL:ILLUMINA @RG ID:RG_4 LB:Lib_10_10065_F2_DNA2H_1 SM:10_10065_F2_DNA2H PL:ILLUMINA @RG ID:RG_5 LB:Lib_10_10065_F2_DNA2H_2 SM:10_10065_F2_DNA2H PL:ILLUMINA @RG ID:RG_6 LB:Lib_10_10065_F2_DNA2H_2 SM:10_10065_F2_DNA2H PL:ILLUMINA

(My assumption was that this is beneficial, so in cases of the same library prep MarkDuplicates would consider all read groups for deduplication that had the same LB: and that BQSR will always model each read group ID: separately which is ideal as they originate from a different lane.)

These BAM files run fine in the HaplotypeCaller in GVCF mode, --bamOutput confirms both read groups are used. I've also checked the index with ValidateSamFile using Picard and I can't see any index related issues in it's output, the only error being ERROR:MATE_NOT_FOUND which I was expecting.

I'm also curious if the answer to my old question with MuTect1 still holds up as I note this related issue with MuTect1 and multiple BAM inputs http://gatkforums.broadinstitute.org/gatk/discussion/4641/build-a-panel-of-normal-for-mutect

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @MattB
    Hi,

    This looks like an issue with the indices. Can you try deleting and re-generating the BAM indices? Also, please post the exact command you ran.

    Thanks,
    Sheila

  • MattBMattB NewcastleMember
    edited August 2016

    Hi Sheila, the first step I did was to delete all .bai and regenerate with BuildBamIndex, so I've ruled that out, the regenerated .bai also fails in the same way. The exact MuTect2 command line I used was:

    /opt/software/java/jdk1.8.0_92/bin/java -XX:-UseLargePages -Djava.io.tmpdir=/localscratch/user_scratch_files/655724.1.all.q -Xmx10g -jar /opt/software/bsu/bin/GenomeAnalysisTK-3.6.jar \
    --disable_auto_index_creation_and_locking_when_reading_rods \
    -T MuTect2 \
    -nct 1 \
    -L ../merged_regions_covered.bed \
    --interval_padding 100 \
    -R /opt/databases/GATK_bundle/2.8/b37/human_g1k_v37_decoy.fasta \
    --cosmic /sharedlustre/users/nmb86/BNHL_exome/Cosmic_b37_v76.vcf --dbsnp /opt/databases/GATK_b
    undle/2.8/b37/dbsnp_138.b37.vcf \
    --input_file:normal /localscratch/user_scratch_files/655724.1.all.q/BNHLexm.1.dedup.realigned.recalibrated.bam --input_file:tumor /localscratch/user_scratch_files/655724.1.all.q/BNHLexm.2.dedup.realigned.recalibrated.bam \
    --out /localscratch/user_scratch_files/655724.1.all.q/10_10065_DNA1_1H_14_0003349_E.vs.10_10065_F2_DNA2H_14_0005020_Y.vcf \ --bamOutput /localscratch/user_scratch_files/655724.1.all.q/10_10065_DNA1_1H_14_0003349_E.vs.10_10065_F2_DNA2H_14_0005020_Y.bam \ 
    --log_to_file /localscratch/user_scratch_files/655724.1.all.q/10_10065_DNA1_1H_14_0003349_E.vs.10_10065_F2_DNA2H_14_0005020_Y.log
    

    I've re run all the down stream steps post SamToSortedBam including the merger operation again to double check, there are no errors reported in previous stages and the problem detecting index type issue is replicated across all 36 bam files I've merged. Has me stumped, I've been using Picard 2.4.1 I'm going to re do the whole run from FASTQ upwards using SamToSortedBam, MarkDuplicates and MergeSamFiles with Picard 2.6.0 and see if that makes any difference to the outcome.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator
    edited August 2016 Accepted Answer

    @MattB
    Hi,

    Hmm. Perhaps it is an issue with the VCF indices. Can you try deleting those and letting GATK re-generate those as well?

    Also, try running ValidateSamFile on your input BAM files.

    -Sheila

  • MattBMattB NewcastleMember

    Ah! Yes that did it, it's very true that is did not state which index it took issue with, so I deleted the Cosmic_b37_v76.vcf.idx (since dbSNP .idx was fine with other tools) and all now runs, the .idx file was generated by MuTect1 running under Java 1.7 and MuTect2 is running via GATK 3.6 under Java 1.8, I'm not sure if this could create differences, I have been using --disable_auto_index_creation_and_locking_when_reading_rods too, as our Luster FS will not support locking but had omitted this for the MuTect1 run, although it did fall back to an in memory index as locking was not supported by the FS. Regardless of how the .idx got mangled it now works, thanks for the suggestion!

    Might it be worth having an extra layer of debug in the "Problem detecting index type" message which states which index the engine is have a problem with, as it's easy to see there could be more than one index in use? Unless there is an existing debug option which could help?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @MattB
    Hi,

    Yes, indeed the index error could be more informative.

    There is an option --logging_level you can use, however, I am not sure it will give you the exact index that is malformed.

    -Sheila

  • I'm not sure if this will be helpful to others but I wanted to point out a similar multilpe-read-group issue that I encountered with mutect2:

    Even though the manual page https://software.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_cancer_m2_MuTect2.php is clear about only using 1 tumor and 1 normal sample, if either of the input bams have read groups with multiple sample names (i.e., different SM tags) then reads from only one of these read groups will be used by mutect2. There is no error (or warning that I noticed) but the VCF will contain additional empty columns for the unused sample names. I've noticed that if my input tumor bam has 3 or more read groups with different sample names then mutect2 will seem to run fine but the VCF file will be empty--containing a header but no variants.

    Obviously, if one does a good job keeping track of read groups/sample names and follows the recommended naming conventions (https://software.broadinstitute.org/gatk/guide/article?id=6472) then there is no issue, but if you merge bams from the same tumor and they that have slightly different sample names then you may not be using all of your data to call variants.

    My work flow now checks each bam header for multiple sample names in the read groups and if necessary uses picard AddOrReplaceReadGroups to replace them with a single one. This extra step is time consuming so if there is a better approach then let me know!

    Issue · Github
    by Sheila

    Issue Number
    1325
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Ah, good catch. There should be a sanity check for this -- we'll see if we can get that put in.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @jfb
    Hi,

    I just put a ticket in GATK4 for this. It will be fixed there :smile:

    -Sheila

Sign In or Register to comment.