GATK4 fails in BwaSpark most likely due to splitting

zhipan

Running GATK4 BwaSpark encounter the following fatal error message:
[M::mem_sam_pe] Paired reads have different names: "206B4ABXX100825:7:66:2632:21260", "206B4ABXX100825:7:66:2632:31752"

./gatk-launch BwaSpark -I $unsorted_bam_hdfs -O $sorted_bam_hdfs -t 10 --disableSequenceDictionaryValidation true -R $ref_hdfs -K 10000000 -- --sparkRunner SPARK --sparkMaster yarn --num-executors 1 --executor-cores 10 --executor-memory 40g

$unsorted_bam_hdfs is an interleaved Bam file generated by FastqToBam, with splitting index and copied to HDFS.

spark 2.0 is used.

The original Fastq files are perfectly fine, and we have been using it for all our tests using previous versions, including 3.6. I also manually checked the generated name-sorted BAM file generated by FastToBam, and the neighboring lines are perfectly paired as well.

What I suspect is that chunk is cut inside a pair, and thus not just this one, all subsequent lines are all error'ed out. To confirm this, I ran the job with different -K and -bps options, and the error will occur at different locations.


  EADG

    Hi @zhipan,

    did you try the latest build https://github.com/broadinstitute/gatk ? Maybe you are lucky....


  Geraldine_VdAuwera
    @zhipan Please check whether this error persists in the latest build. If it does, we'll need a test file that reproduces the error for debugging.
  zhipan

    OK... my build was from two weeks ago. I will try again. Thanks.

  zhipan

    Just tried the latest version, I got the same error. BTW, I did look at the jbwa code, and it seems that SMARTPE mode is completely skipped, thus enforcing strict pairing of neighboring reads. On the other hand, HadoopBAM input format does not special handle interleaved BAM files, making sure no cut is done between two pair reads. I think this error is almost certain to happen.

  Geraldine_VdAuwera
    I see, that makes sense. I checked with the devs regarding the status of this tool -- it's currently considered unfinished and its development is on hold due to other priorities, so it's completely unsupported. I'll see what I can do to document that fact more clearly.

    That being said, if you need it for your work and you feel you could fix it up (you sound like you know your way around Java code) you're very welcome to propose a pull request.
  zhipan

    Thanks a lot for the comment. I think I will stick with gatk3.6 for the moment, and will re-visit when I have time again. Just curious, do you know when gatk4 will be available for public release? Current timeline? Thanks.

  Geraldine_VdAuwera

    Hi @zhipan, we expect to move GATK4 to beta status in January or February. At that point we'll encourage people to beta-test the software so that we can fix any common problems, which I think will take a couple of months. So we would be ready with a full 4.0 release sometime in the Spring, probably April if I had to put money on it. But that's assuming nothing major comes up between now and then.

  zhipan

    Thanks a lot for your comment. I will see what I can do for the moment, and will surely join the beta-test when it comes.

