GATK4 fails in BwaSpark most likely due to splitting
Running GATK4 BwaSpark encounter the following fatal error message:
[M::mem_sam_pe] Paired reads have different names: "206B4ABXX100825:7:66:2632:21260", "206B4ABXX100825:7:66:2632:31752"
./gatk-launch BwaSpark -I $unsorted_bam_hdfs -O $sorted_bam_hdfs -t 10 --disableSequenceDictionaryValidation true -R $ref_hdfs -K 10000000 -- --sparkRunner SPARK --sparkMaster yarn --num-executors 1 --executor-cores 10 --executor-memory 40g
$unsorted_bam_hdfs is an interleaved Bam file generated by FastqToBam, with splitting index and copied to HDFS.
spark 2.0 is used.
The original Fastq files are perfectly fine, and we have been using it for all our tests using previous versions, including 3.6. I also manually checked the generated name-sorted BAM file generated by FastToBam, and the neighboring lines are perfectly paired as well.
What I suspect is that chunk is cut inside a pair, and thus not just this one, all subsequent lines are all error'ed out. To confirm this, I ran the job with different -K and -bps options, and the error will occur at different locations.