To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

SplitNCigarReads results in all reads failing MalformedReadFilter?

Hi, I'm following the GATK best Practices workflow for SNP and indel calling on my RNAseq data, I'm currently at the "Split'N'Trim and reassign mapping qualities" step but come across some problems, below is my command:

java -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fa -I dedupped.bam -o split.bam -rf \ ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS \ --filter_mismatching_base_and_quals

Here's what I got:
INFO 00:13:19,415 ProgressMeter - Total runtime 66.03 secs, 1.10 min, 0.02 hours
INFO 00:13:19,418 MicroScheduler - 37770935 reads were filtered out during the traversal out of approximately 37770935 total reads (100.00%)
INFO 00:13:19,418 MicroScheduler - -> 0 reads (0.00% of total) failing BadCigarFilter
INFO 00:13:19,418 MicroScheduler - -> 37770935 reads (100.00% of total) failing MalformedReadFilter
INFO 00:13:19,419 MicroScheduler - -> 0 reads (0.00% of total) failing ReassignOneMappingQualityFilter

I also tried without "--filter_mismatching_base_and_quals" but it returned this ERROR message:

##### ERROR MESSAGE: SAM/BAM/CRAM file htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter@36c1b6ff is malformed. Please see https://software.broadinstitute.org/gatk/documentation/article?id=1317for more information. Error details: BAM file has a read with mismatching number of bases and base qualities. Offender: 23570007 [51 bases] [0 quals]. You can use --defaultBaseQualities to assign a default base quality for all reads, but this can be dangerous in you don't know what you are doing.

I'm fairly new to GATK and can't really think of a reason why this would happen, so I would really appreciate if anyone can help.
JIn

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Jin_Cui
    Hi Jin,

    Have a look at this thread.

    -Sheila

  • Hi Sheila,
    Thanks for your reply, so I've added --defaultBaseQualities -1 and it gave me an error message:

    Argument --defaultBaseQualities has value -1.00, but minimum allowed value is 0.00.

    Then I tried --defaultBaseQualities 0 and it seems to be working, but no reads were filtered out:

    INFO 18:48:20,518 MicroScheduler - 0 reads were filtered out during the traversal out of approximately 37770935 total reads (0.00%)
    INFO 18:48:20,518 MicroScheduler - -> 0 reads (0.00% of total) failing BadCigarFilter
    INFO 18:48:20,518 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter
    INFO 18:48:20,519 MicroScheduler - -> 0 reads (0.00% of total) failing ReassignOneMappingQualityFilter

    And previously when I tried "--filter_mismatching_base_and_quals" but all reads failed MalformedReadFilter.

    I'm just wondering if there's something wrong with my input bam file but can't really think of a reason since I followed everything as suggested in the RNA-seq best practices (other than I had single-end reads and I used --sjdbOverhang 50 instead of 75).

  • Hi,
    I was trying GATK4 beta (gatk-4.beta.5) for SplitNCigarReads and it seems to be working, here's my command:

    ./gatk-launch SplitNCigarReads -R ref.fa -I sample.dedupped.bam -O sample.split.bam

    But when I was following the workflow using gatk4 I came across another problem with HaplotypeCaller, here's my command:

    ./gatk-launch --javaOptions "-Xmx4g" HaplotypeCaller -R ref.fa -I input.bam --dontUseSoftClippedBases -stand_call_conf 20.0 -O output.vcf

    Here's the part of what I got:

    **13:15:13.126 INFO HaplotypeCaller - 44216474 read(s) filtered by: ((((((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND PrimaryAlignmentReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter) AND NonZeroReferenceLengthAlignmentReadFilter) AND GoodCigarReadFilter) AND WellformedReadFilter)
    29389190 read(s) filtered by: (((((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND PrimaryAlignmentReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter) AND NonZeroReferenceLengthAlignmentReadFilter) AND GoodCigarReadFilter)
    29389190 read(s) filtered by: ((((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND PrimaryAlignmentReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter) AND NonZeroReferenceLengthAlignmentReadFilter)
    29389190 read(s) filtered by: (((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND PrimaryAlignmentReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter)
    29389190 read(s) filtered by: ((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND PrimaryAlignmentReadFilter) AND NotDuplicateReadFilter)
    10513014 read(s) filtered by: (((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND PrimaryAlignmentReadFilter)
    10513014 read(s) filtered by: ((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter)
    10513014 read(s) filtered by: (MappingQualityReadFilter AND MappingQualityAvailableReadFilter)
    10513014 read(s) filtered by: MappingQualityReadFilter
    18876176 read(s) filtered by: NotDuplicateReadFilter
    14827284 read(s) filtered by: WellformedReadFilter **

    13:15:13.127 INFO ProgressMeter - B73V4_ctg76:3001 5.5 7114708 1304035.3
    13:15:13.127 INFO ProgressMeter - Traversal complete. Processed 7114708 total regions in 5.5 minutes.
    13:15:13.128 INFO VectorLoglessPairHMM - Time spent in setup for JNI call : 0.0
    13:15:13.128 INFO PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 0.0
    13:15:13.128 INFO HaplotypeCaller - Shutting down engine
    [October 5, 2017 1:15:13 PM EDT] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 5.48 minutes.
    Runtime.totalMemory()=2225078272

    The output.vcf doesn't contain any SNP/indels, would appreciate some suggestions on how to deal with this.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Jin_Cui
    Hi Jin,

    Thanks for trying GATK4, but I think the issue is with your base qualities. How did you align your reads to the reference? Did you do anything to the BAM file before running MarkDuplicates? Did MarkDuplicates throw any errors? If you look in IGV, do the base qualities look good or are they all very low?

    Thanks,
    Sheila

  • Hi Sheila,
    Thanks for you reply, I'm putting an igv snapshot of the dedupped.bam and split.bam above. I was able to get through SplitNCigarReads but the HaplotypeCaller step gives me a vcf with no SNP/InDels. I'm pasting all my commands below.

    I'm also including all the results in the attachment. I modified the 2pass alignment with --outSAMmapqUnique so that it would be compatible with GATK, I haven't seen any errors in the MarkDuplicates step.

    ./STAR --runThreadN 16 --runMode genomeGenerate --genomeDir STAR.idx --genomeFastaFiles ref.fa --sjdbGTFfile ref.gtf --sjdbOverhang 50

    ./STAR --runThreadN 16 --genomeDir ../STAR.idx --readFilesIn sample.fa

    ./STAR --runThreadN 16 --runMode genomeGenerate --genomeDir genomeDir_2pass --genomeFastaFiles ref.fa --sjdbFileChrStartEnd runDir1/SJ.out.tab --sjdbOverhang 50

    ./STAR --runThreadN 16 --genomeDir ../genomeDir_2pass --readFilesIn sample.fa --outSAMmapqUnique 60

    java -jar picard.jar AddOrReplaceReadGroups I=Aligned.out.sam O=sample.rg_added_sorted.bam SO=coordinate RGID=1 RGLB=Mutant RGPL=illumina RGPU=1 RGSM=sample1

    java -jar picard.jar MarkDuplicates I=sample.rg_added_sorted.bam O=sample.dedupped.bam CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT M=output.metrics

    ./gatk-launch SplitNCigarReads -R ref.fa -I sample.dedupped.bam -O sample.split.bam

    ./gatk-launch --javaOptions "-Xmx120g" HaplotypeCaller -R ref.fa -I split.bam --dontUseSoftClippedBases -stand_call_conf 20.0 -O output.vcf

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Jin_Cui
    Hi Jin,

    What happens if you remove --dontUseSoftClippedBases -stand_call_conf 20.0 from your command?

    Thanks,
    Sheila

  • Hi Sheila,

    I tried removing it but I think it's the same results with 0 SNPs or InDels, Ialso tried GVCF mode by adding "-ERC GVCF" but then the Haplotype blocks are the entire genome!!

    I'm working on plant RNA-seq data, so it's a little different from human data and I know the data isn't perfect (the alignment rate is just ~82%), but I was able to call variants using bcftools, just wondering what could go wrong with gatk.

    Jin

  • SheilaSheila Broad InstituteMember, Broadie, Moderator
    edited October 2017

    @Jin_Cui
    Hi Jin,

    Hmm. Can you post the GVCF block you get?

    Thanks,
    Sheila

    EDIT: Can you also post an IGV screenshot of a region where you think variants should be called?

Sign In or Register to comment.