Read more about it here!
Dear GATK Team,
when monitoring the INFO logging of my PrintReads command, I noticed that the last contig to be processed (in my case human chromosome Y) took significantly longer than each of the previous contigs. What is also strange is that the estimated time (last column in the log) was already down to like one minute at the end of the previous contig (chrX); although chrY is relatively short, it took up almost one third of the total run time of the PrintReads command. I repeated the same command three times to make sure it was not a momentary slowdown of our cluster, and it always happened in the last contig (2x chrY and once the chrUn contigs of the human genome). Have you observed this, too? Is this something expected? Or can it be avoided (and thus speed up the execution)?
The following 2 commands were used (this was the step after BaseRecalibrator); I am attaching the logging output files.
java -Xmx4g -Djava.io.tmpdir=tmp -jar GATK/2.4.9/GenomeAnalysisTK.jar -T PrintReads -R hsapiens_coordsort_v37.fa --input_file rmdup.bam -BQSR rmdup.grp -o recal.bam -L hg19_chromosomes.bed java -Xmx4g -Djava.io.tmpdir=tmp -jar GATK/2.4.9/GenomeAnalysisTK.jar -T PrintReads -R hsapiens_coordsort_v37.fa --input_file rmdup.bam -BQSR rmdup.grp -o recal.bam
Many thanks for your comments and suggestions,