How to run the entire pipeline (using even Spark tools) from Java?
I am trying to write a Java pipeline which follows the GATK Best Practices, in particular, using more than one input sample.
As first steps, I am trying to use FastqToSam (even if not mandatory for the Best Practices, but required in case of using fastq samples), BwaAndMarkDuplicatesPipelineSpark and BQSRPipelineSpark.
For example with FastqToSam I am using this simple approach, in which I manage to "sparkify" the command with more samples and obtaining even some speedup:
JavaRDD<String> rdd_fastq_r1_r2 = sc.parallelize(fastq_r1_r2); createBashScript(gatkCommand); JavaRDD<String> bashExec = rdd_fastq_r1_r2.pipe("/path/script.sh");
where fastq_r1_r2 is a list of String representing the paths of samples to use.
In few words, I execute a bash command for each couple of Paired End Reads file (in particular the bash command as explained here) inside the pipe method provided by Spark
java -Xmx8G -jar picard.jar FastqToSam [...]
But this approach would not work with Spark GATK tools, like BwaAndMarkDuplicatesPipelineSpark and BQSRPipelineSpark.
So, is there any other way to execute these Spark tools in Java code? For example 4.5 years ago in this post they suggested to use org.broadinstitute.sting.gatk.CommandLineGATK, but now this class is not available anymore.
And moreover, is available any kind of Java API (and in case any tutorial), in order to use your methods (I could say in a similar way of Spark API) without using bash commands?
Thanks for your time and I hope to be clear in explaining my questions,