We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Error in ReadsPipelineSpark 4.1.4 when using -L interval list option

Hi all.
Always trying to tune the pipeline in our environment...

When I add the -L option to the ReadsPipelineSpark I obtain the following error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 0.0 failed 4 times, most recent failure: Lost task 4.3 in stage 0.0 (TID 25, cloudera02.opbg.dom, executor 2): java.lang.IllegalArgumentException: Contig chr1 not present in reads sequence dictionary
at org.disq_bio.disq.impl.formats.BoundedTraversalUtil.convertSimpleIntervalToQueryInterval(BoundedTraversalUtil.java:73)
at org.disq_bio.disq.impl.formats.BoundedTraversalUtil.lambda$prepareQueryIntervals$0(BoundedTraversalUtil.java:46)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:545)
at java.util.stream.AbstractPipeline.evaluateToArrayNode(AbstractPipeline.java:260)
at java.util.stream.ReferencePipeline.toArray(ReferencePipeline.java:438)
at org.disq_bio.disq.impl.formats.BoundedTraversalUtil.prepareQueryIntervals(BoundedTraversalUtil.java:47)
at org.disq_bio.disq.impl.formats.sam.AbstractBinarySamSource.lambda$getReads$c0b65654$1(AbstractBinarySamSource.java:128)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$4$1.apply(JavaRDDLike.scala:153)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$4$1.apply(JavaRDDLike.scala:153)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun
...
..

The tool works well when I call it WITHOUT the -L interval list option.

This is the command I run:

nohup /opt/gatk/gatk-4.1.4.0/gatk ReadsPipelineSpark --spark-runner SPARK --spark-master yarn --spark-submit-command spark2-submit -I hdfs://cloudera08/gatk-test2/WES2019-023_S6_reheader.bam -O hdfs://cloudera08/gatk-test2/WES2019-023_S6_out.g.vcf -R hdfs://cloudera08/gatk-test1/ucsc.hg19.fasta -L hdfs://cloudera08/gatk-test2/RefGene_exons.bed --dbsnp hdfs://cloudera08/gatk-test1/dbsnp_150_hg19.vcf.gz --known-sites hdfs://cloudera08/gatk-test1/Mills_and_1000G_gold_standard.indels.hg19.vcf.gz --align true --emit-ref-confidence GVCF --standard-min-confidence-threshold-for-calling 100.0 --conf deploy-mode=cluster --conf "spark.driver.memory=2g" --conf "spark.executor.memory=18g" --conf "spark.storage.memoryFraction=1" --conf "spark.akka.frameSize=200" --conf "spark.default.parallelism=100" --conf "spark.core.connection.ack.wait.timeout=600" --conf "spark.yarn.executor.memoryOverhead=4096" --conf "spark.yarn.driver.memoryOverhead=400" > WES2019-023_S6.out &

My input BAM is coming from the FastqToSam conversion tool, and the header is:

[[email protected] WES2019-023-40045]# ../samtools-1.7/samtools view -H WES2019-023_S6.bam
@HD VN:1.6 SO:queryname
@RG ID:WES2019-023 SM:WES2019-023_S6 PL:illumina PU:L1

I tried to reheader the BAM adding the SQ tag adding the references to the fasta file but I had the same error.

Since I am struggling from some days on this problem, could you please help me in identifying in which way I can go forward?

Thanks a lot in advance.
Alessandro

Tagged:

Answers

Sign In or Register to comment.