To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

SplitNCigarReads generates badCigar reads

Anne_ssAnne_ss Member
edited May 2016 in Ask the GATK team

For our RNA variant calling pipeline, we follow the GATK best practices workflow (STAR 2-pass -> mark duplicates & sort -> SplitNTrim -> indel realignement -> base recalibration -> variantcalling).

After SplitNCigarReads is succesfully run, the InderRealigner filters out reads because of failing BadCigarFilter. That seems strange to me, because I don't think that SplitNCigarReads should cause the reads with a BadCigarFilter.
Can someone explain to me what is happening in SplitNCigarReads to the reads that are filtered out?

For example, here are the number of reads after each steps for a Sample:
AddOrReplaceReadGroups: 20556038
MarkDuplicates: 20556038
SplitNCigarReads: 22177873
IndelRealigner: 22177252

So for this sample, 621 reads are filtered out, which is also reported in the logfile of indelRealigner:
INFO 11:10:40,726 MicroScheduler - 621 reads were filtered out during the traversal out of approximately 22177873 total reads (0.00%)
INFO 11:10:40,726 MicroScheduler - -> 621 reads (0.00% of total) failing BadCigarFilter
INFO 11:10:40,727 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter

These are the commands (GATK v3.4) I used (untill realignment):
STAR --genomeDir Homo_sapiens.GRCh37 --runThreadN 4 --outFileNamePrefix sample_ --outReadsUnmapped Fastx --outSAMtype BAM SortedByCoordinate --readFilesCommand zcat --outSJfilterIntronMaxVsReadN 10000000 --chimJunctionOverhangMin 15 --chimSegmentMin 15 --twopassMode Basic --readFilesIn Sample_L001_R1_001.fastq.gz,Sample_L002_R1_001.fastq.gz,Sample_L003_R1_001.fastq.gz,Sample_L004_R1_001.fastq.gz

java -jar AddOrReplaceReadGroups.jar INPUT=Sample_Aligned.sortedByCoord.out.bam OUTPUT=Sample_sorted.bam RGID=Sample_L004_R1 RGLB=R1 RGPL=ILLUMINA RGPU=L004 RGSM=Sample

java -jar MarkDuplicates.jar INPUT=Sample_sorted.bam OUTPUT=Sample_sorted_dedupped.bam CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT M=Sample_markDup_metrics.txt

java -jar GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -T SplitNCigarReads -R Homo_sapiens.GRCh37.GATK.illumina.fa -I Sample_sorted_dedupped.bam -o Sample_sorted_dedupped_splitN.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS

java -jar GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -T RealignerTargetCreator -R Homo_sapiens.GRCh37.GATK.illumina.fa -I Sample_sorted_dedupped_splitN.bam -o Sample_target.intervals

java -jar GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -T IndelRealigner -RHomo_sapiens.GRCh37.GATK.illumina.fa -I Sample_sorted_dedupped_splitN.bam -targetIntervals Sample_target.intervals -o Sample_sorted_dedupped_splitN_realigned.bam

Issue · Github
by Sheila

Issue Number
950
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
chandrans

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Anne_ss
    Hi,

    It's possible that a few reads had odd conformations that were not handled properly by the tool, but it's such a small number of reads that it will have no impact on your analysis, so you can ignore it.

    -Sheila

  • Hi Sheila, thanks for your answer.

    Indeed for this sample it's a small number of reads that are excluded, but in some cases there were ~3000 reads filtered out.

    Our pipeline has a check after each tool to see if the number of reads are still the same (to make sure no reads will be lost during the different steps), so it seems strange to us to build in an exception for SplitNCigarReads because it somehow wrongly alters some cigar strings that other gatk tools don't except anymore.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    Hi Anne_ss,

    We understand that this is a bit puzzling and not ideal, but frankly even 3000 out of 22 million is a small enough fraction that it's unlikely to be worth much time tracking down the source. I would recommend setting your pipeline checks to check for a threshold proportion of reads rather than absolute numbers.

    That being said, if you really want to figure out what's happening, you need to isolate a subset of reads that is affected and post what the records look like before and after processing by the tool. Also, make sure that the "before" file passes validation with ValidateSamFile.
Sign In or Register to comment.