This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!
Using PrintReads with fixMisencodedQuals with RNA-seq data
I am using RNA-seq data for SNP calling and I am following the suggested pipeline of GATK. I progressed until Split'N,Trim but then I realized that the quality scores have the format of pre 1.8 illumina. So, I decided to use PrintReads with fixMisencodedQuals option to convert the quality values to the current one. But I am getting this error and message:
MESSAGE: Unsupported CIGAR operator N in read HWI-ST344_0078:1:1107:8636:141314#GCCAAT at scaffold53:7983. Perhaps you are trying to use RNA-Seq data? While we are currently actively working to support this data type unfortunately the GATK cannot be used with this data in its current form. You have the option of either filtering out all reads with operator N in their CIGAR string (please add --f
ilter_reads_with_N_cigar to your command line) or assume the risk of processing those reads as they are including the pertinent unsafe flag (please add -U ALLOW_N_CIGAR_READS to your command line). Notice however that if you were to choose the latter, an unspecified subset of the analytical outputs of an unspecified subset of the tools will become unpredictable. Consequently the GATK team might well not be able to provide you with the usual support with any issue regarding any output
My question is whether filtering out all reads with operator N in their CIGAR string will affect the downstream process somehow and how that effect would be like. I would appreciate some more explanation on this problem.
The is the command that I am using:
java -jar GenomeAnalysisTK.jar -T PrintReads -fixMisencodedQuals -I input.bam -R reference.fa -o out.bam
Thank you in advance,