Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!

BQSRPipelineSpark can't run under joinStrategy in SHUFFLE model

liuchengliucheng ChinaMember
edited July 2017 in Ask the GATK team

I tried to process data with BQSRPipelineSpark( the latest released gatk4 beta version),while it didn't work out unless the data size is small .To illustrate it, we conduct experiments in data extracted from ERR000589. It would work when there is only 10,000 sam record (the sam file is 2M), while it failed if the sam record is more than 10,000 (data size is 20M). I'd tried to adjust the memory of each excutor by increasing it from 30G to 60G, but the prgram still won't work.

I uses dbsnp_138.hg19.vcf, and the data size is 10G.
reference is ucsc.hg19.2bit, data size is 0.8G .
it was running on spark2.0, and there were 4 worker in total. Each node had 16 physical cores and 64G data memory.

Below is my command.
./gatk-launch BQSRPipelineSpark -I hdfs:///user/liucheng/ERR000589.bwa.mark.bam -O hdfs:///user/liucheng/ERR000589.bwa.mark.bqsr.bam -R hdfs:///user/liucheng/refs/ucsc.hg19.fasta --knownSites hdfs:///user/liucheng/dbsnp/dbsnp_138.hg19.vcf -joinStrategy SHUFFLE -- --sparkRunner SPARK --sparkMaster spark://cu11:7077 --total-executor-cores 48 --executor-cores 6 --executor-memory 25G --driver-memory 30G

The log is attached as follow:

[July 19, 2017 2:45:18 PM CST] org.broadinstitute.hellbender.tools.spark.pipelines.BQSRPipelineSpark done. Elapsed time: 3.57 minutes.
Runtime.totalMemory()=1559232512
org.apache.spark.SparkException: Job aborted due to stage failure: Task 25 in stage 5.0 failed 4 times, most recent failure: Lost task 25.3 in stage 5.0 (TID 418, 192.168.0.10, executor 1): htsjdk.samtools.SAMException: Unable to load chr13(72100194, 72110026) from /user/liucheng/refs/ucsc.hg19.fasta
    at htsjdk.samtools.reference.IndexedFastaSequenceFile.getSubsequenceAt(IndexedFastaSequenceFile.java:247)
    at org.broadinstitute.hellbender.engine.datasources.ReferenceHadoopSource.getReferenceBases(ReferenceHadoopSource.java:33)
    at org.broadinstitute.hellbender.engine.datasources.ReferenceMultiSource.getReferenceBases(ReferenceMultiSource.java:99)
    at org.broadinstitute.hellbender.engine.spark.ShuffleJoinReadsWithRefBases.lambda$addBases$cff38836$1(ShuffleJoinReadsWithRefBases.java:123)
    at org.broadinstitute.hellbender.engine.spark.ShuffleJoinReadsWithRefBases$$Lambda$78/1542874140.call(Unknown Source)
    at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$3$1.apply(JavaRDDLike.scala:143)
    at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$3$1.apply(JavaRDDLike.scala:143)
    at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30)
    at org.broadinstitute.hellbender.tools.spark.transforms.BaseRecalibratorSparkFn.lambda$apply$26a6df3e$1(BaseRecalibratorSparkFn.java:27)
    at org.broadinstitute.hellbender.tools.spark.transforms.BaseRecalibratorSparkFn$$Lambda$89/403397455.call(Unknown Source)
    at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$4$1.apply(JavaRDDLike.scala:153)
    at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$4$1.apply(JavaRDDLike.scala:153)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:796)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:796)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-796786996-222.201.145.253-1457530889871:blk_1074761762_1022478 file=/user/liucheng/refs/ucsc.hg19.fasta
    at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:930)
    at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:609)
    at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:841)
    at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:889)
    at java.io.DataInputStream.read(DataInputStream.java:149)
    at java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:385)
    at hdfs.jsr203.HadoopFileSystem$3.read(HadoopFileSystem.java:478)
    at htsjdk.samtools.reference.IndexedFastaSequenceFile.readFromPosition(IndexedFastaSequenceFile.java:292)
    at htsjdk.samtools.reference.IndexedFastaSequenceFile.getSubsequenceAt(IndexedFastaSequenceFile.java:244)
    ... 32 more

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
    at scala.Option.foreach(Option.scala:257)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1981)
    at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1025)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
    at org.apache.spark.rdd.RDD.reduce(RDD.scala:1007)
    at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
    at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1127)
    at org.apache.spark.api.java.JavaRDDLike$class.treeAggregate(JavaRDDLike.scala:439)
    at org.apache.spark.api.java.AbstractJavaRDDLike.treeAggregate(JavaRDDLike.scala:45)
    at org.broadinstitute.hellbender.tools.spark.transforms.BaseRecalibratorSparkFn.apply(BaseRecalibratorSparkFn.java:39)
    at org.broadinstitute.hellbender.tools.spark.pipelines.BQSRPipelineSpark.runTool(BQSRPipelineSpark.java:110)
    at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:353)
    at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:38)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:115)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:170)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:189)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:131)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:152)
    at org.broadinstitute.hellbender.Main.main(Main.java:230)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: htsjdk.samtools.SAMException: Unable to load chr13(72100194, 72110026) from /user/liucheng/refs/ucsc.hg19.fasta
    at htsjdk.samtools.reference.IndexedFastaSequenceFile.getSubsequenceAt(IndexedFastaSequenceFile.java:247)
    at org.broadinstitute.hellbender.engine.datasources.ReferenceHadoopSource.getReferenceBases(ReferenceHadoopSource.java:33)
    at org.broadinstitute.hellbender.engine.datasources.ReferenceMultiSource.getReferenceBases(ReferenceMultiSource.java:99)
    at org.broadinstitute.hellbender.engine.spark.ShuffleJoinReadsWithRefBases.lambda$addBases$cff38836$1(ShuffleJoinReadsWithRefBases.java:123)
    at org.broadinstitute.hellbender.engine.spark.ShuffleJoinReadsWithRefBases$$Lambda$78/1542874140.call(Unknown Source)
    at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$3$1.apply(JavaRDDLike.scala:143)
    at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$3$1.apply(JavaRDDLike.scala:143)
    at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30)
    at org.broadinstitute.hellbender.tools.spark.transforms.BaseRecalibratorSparkFn.lambda$apply$26a6df3e$1(BaseRecalibratorSparkFn.java:27)
    at org.broadinstitute.hellbender.tools.spark.transforms.BaseRecalibratorSparkFn$$Lambda$89/403397455.call(Unknown Source)
    at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$4$1.apply(JavaRDDLike.scala:153)
    at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$4$1.apply(JavaRDDLike.scala:153)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:796)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:796)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-796786996-222.201.145.253-1457530889871:blk_1074761762_1022478 file=/user/liucheng/refs/ucsc.hg19.fasta
    at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:930)
    at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:609)
    at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:841)
    at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:889)
    at java.io.DataInputStream.read(DataInputStream.java:149)
    at java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:385)
    at hdfs.jsr203.HadoopFileSystem$3.read(HadoopFileSystem.java:478)
    at htsjdk.samtools.reference.IndexedFastaSequenceFile.readFromPosition(IndexedFastaSequenceFile.java:292)
    at htsjdk.samtools.reference.IndexedFastaSequenceFile.getSubsequenceAt(IndexedFastaSequenceFile.java:244
Tagged:

Issue · Github
by shlee

Issue Number
2318
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
vdauwera

Best Answers

Answers

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    @liucheng, I'll see what our developers have to say.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    For the record, the devs made a ticket to remove that option to avoid anyone else wasting their time on it. Thanks for reporting this, @liucheng.

  • liuchengliucheng ChinaMember
    edited July 2017

    Thank you for your prompt reply.@shlee @Geraldine_VdAuwera
    In fact, I've tried both OVERLAPS_PARTITIONER and BROADCAST. Things went well with OVERLAPS_PARTITIONER, but it didnt work with BROADCAST, even with data size as small as 100 records.

    I use ERR000589, the bam file size is 1.3G.
    knownSites uses dbsnp_138.hg19.vcf, and the data size is 10G.
    reference is ucsc.hg19.2bit, data size is 0.8G .
    it was running on spark2.0, and there are 4 worker in total. Each node has 16 physical cores and 64G data memory.

    Below is my command.
    ./gatk-launch BQSRPipelineSpark -I hdfs:///user/xxx/ERR000589.bwa.mark.bam -O hdfs:///user/xxx/ERR000589.bwa.marked.bqsr.bam -R hdfs:///user/liucheng/refs/ucsc.hg19.2bit --knownSites hdfs:///user/liucheng/dbsnp/dbsnp_138.hg19.vcf -- --sparkRunner SPARK --sparkMaster spark://cu11:7077 --total-executor-cores 48 --executor-cores 6 --executor-memory 25G --driver-memory 30G

    The log is attached as follow:
    [July 19, 2017 2:39:55 PM CST] org.broadinstitute.hellbender.tools.spark.pipelines.BQSRPipelineSpark done. Elapsed time: 3.24 minutes. Runtime.totalMemory()=23515365376 com.esotericsoftware.kryo.KryoException: java.lang.NegativeArraySizeException Serialization trace: vs (org.broadinstitute.hellbender.utils.collections.IntervalsSkipListOneContig) intervals (org.broadinstitute.hellbender.utils.collections.IntervalsSkipList) at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:101) at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) at com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:606) at com.esotericsoftware.kryo.serializers.MapSerializer.write(MapSerializer.java:109) at com.esotericsoftware.kryo.serializers.MapSerializer.write(MapSerializer.java:39) at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) at org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:207) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:268) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:268) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1303) at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:269) at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:126) at org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:88) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:56) at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1411) at org.apache.spark.api.java.JavaSparkContext.broadcast(JavaSparkContext.scala:650) at org.broadinstitute.hellbender.engine.spark.BroadcastJoinReadsWithVariants.join(BroadcastJoinReadsWithVariants.java:27) at org.broadinstitute.hellbender.engine.spark.AddContextDataToReadSpark.add(AddContextDataToReadSpark.java:68) at org.broadinstitute.hellbender.tools.spark.pipelines.BQSRPipelineSpark.runTool(BQSRPipelineSpark.java:108) at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:353) at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:38) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:115) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:170) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:189) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:131) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:152) at org.broadinstitute.hellbender.Main.main(Main.java:230) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.NegativeArraySizeException at com.esotericsoftware.kryo.util.IdentityObjectIntMap.resize(IdentityObjectIntMap.java:447) at com.esotericsoftware.kryo.util.IdentityObjectIntMap.putStash(IdentityObjectIntMap.java:245) at com.esotericsoftware.kryo.util.IdentityObjectIntMap.push(IdentityObjectIntMap.java:239) at com.esotericsoftware.kryo.util.IdentityObjectIntMap.put(IdentityObjectIntMap.java:135) at com.esotericsoftware.kryo.util.IdentityObjectIntMap.putStash(IdentityObjectIntMap.java:246) at com.esotericsoftware.kryo.util.IdentityObjectIntMap.push(IdentityObjectIntMap.java:239) at com.esotericsoftware.kryo.util.IdentityObjectIntMap.put(IdentityObjectIntMap.java:135) at com.esotericsoftware.kryo.util.MapReferenceResolver.addWrittenObject(MapReferenceResolver.java:41) at com.esotericsoftware.kryo.Kryo.writeReferenceOrNull(Kryo.java:658) at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:623) at com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) at com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) ... 39 more

  • liuchengliucheng ChinaMember

    finally, I wanna know whether there is a suggested hardware minimum requirement for GATK4 to accomplish best prectice pipeline? and if my temporary clusting computing is capable to do all these work?

    CPU 2 X 8 physical core for each node
    node: 4
    network: GBE
    memory: 64G

    Issue · Github
    by shlee

    Issue Number
    2324
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • liuchengliucheng ChinaMember
    edited July 2017

    @shlee I saw a video of successfully running BQSRPipelineSpark under broadcast mode on youtube. I wonder whether this means that the broadcast mode can only run on GSC? Also, I've visited the github website you provided several days ago. I tried to modify spark conf, but exceptions still pop up.
    For overlaps mode, although I've run it successfully on BQSRPipelineSpark, I failed when I tried to run ReadsSparkPipeline , and the exception was very similar with the case of running BQSRPipelineSpark under shuffle mode.
    I would be grateful if you could tell me the reason why these failures occurred.

    thanks again for your patience.

    21:29:07.827 INFO  ReadsPipelineSpark - Shutting down engine
    [July 21, 2017 9:29:07 PM CST] org.broadinstitute.hellbender.tools.spark.pipelines.ReadsPipelineSpark done. Elapsed time: 13.87 minutes.
    Runtime.totalMemory()=1703411712
    org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed 4 times, most recent failure: Lost task 0.3 in stage 9.0 (TID 874, 192.168.0.8, executor 0): htsjdk.samtools.SAMException: Could not open sequence dictionary file: /user/liucheng/refs/ucsc.hg19.dict
        at htsjdk.samtools.reference.AbstractFastaSequenceFile.<init>(AbstractFastaSequenceFile.java:76)
        at htsjdk.samtools.reference.IndexedFastaSequenceFile.<init>(IndexedFastaSequenceFile.java:90)
        at htsjdk.samtools.reference.IndexedFastaSequenceFile.<init>(IndexedFastaSequenceFile.java:111)
        at htsjdk.samtools.reference.ReferenceSequenceFileFactory.getReferenceSequenceFile(ReferenceSequenceFileFactory.java:123)
        at htsjdk.samtools.reference.ReferenceSequenceFileFactory.getReferenceSequenceFile(ReferenceSequenceFileFactory.java:106)
        at htsjdk.samtools.reference.ReferenceSequenceFileFactory.getReferenceSequenceFile(ReferenceSequenceFileFactory.java:95)
        at org.broadinstitute.hellbender.engine.datasources.ReferenceHadoopSource.getReferenceBases(ReferenceHadoopSource.java:32)
        at org.broadinstitute.hellbender.engine.datasources.ReferenceMultiSource.getReferenceBases(ReferenceMultiSource.java:99)
        at org.broadinstitute.hellbender.engine.spark.AddContextDataToReadSpark$1.call(AddContextDataToReadSpark.java:121)
        at org.broadinstitute.hellbender.engine.spark.AddContextDataToReadSpark$1.call(AddContextDataToReadSpark.java:114)
        at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$3$1.apply(JavaRDDLike.scala:143)
        at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$3$1.apply(JavaRDDLike.scala:143)
        at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
        at scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30)
        at org.broadinstitute.hellbender.tools.spark.transforms.BaseRecalibratorSparkFn.lambda$apply$26a6df3e$1(BaseRecalibratorSparkFn.java:27)
        at org.broadinstitute.hellbender.tools.spark.transforms.BaseRecalibratorSparkFn$$Lambda$203/1698715390.call(Unknown Source)
        at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$4$1.apply(JavaRDDLike.scala:153)
        at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$4$1.apply(JavaRDDLike.scala:153)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:796)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:796)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
        at org.apache.spark.scheduler.Task.run(Task.scala:99)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
    Caused by: htsjdk.samtools.util.RuntimeIOException: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-796786996-222.201.145.253-1457530889871:blk_1074761745_1022461 file=/user/liucheng/refs/ucsc.hg19.dict
        at htsjdk.samtools.util.BufferedLineReader.peek(BufferedLineReader.java:98)
        at htsjdk.samtools.SAMTextHeaderCodec.advanceLine(SAMTextHeaderCodec.java:134)
        at htsjdk.samtools.SAMTextHeaderCodec.decode(SAMTextHeaderCodec.java:94)
        at htsjdk.samtools.reference.AbstractFastaSequenceFile.<init>(AbstractFastaSequenceFile.java:68)
        ... 36 more
    Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-796786996-222.201.145.253-1457530889871:blk_1074761745_1022461 file=/user/liucheng/refs/ucsc.hg19.dict
        at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:930)
        at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:609)
        at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:841)
        at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:889)
        at java.io.DataInputStream.read(DataInputStream.java:149)
        at java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:385)
        at hdfs.jsr203.HadoopFileSystem$3.read(HadoopFileSystem.java:478)
        at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65)
        at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109)
        at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103)
        at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
        at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
        at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
        at java.io.InputStreamReader.read(InputStreamReader.java:184)
        at java.io.BufferedReader.fill(BufferedReader.java:161)
        at java.io.BufferedReader.readLine(BufferedReader.java:324)
        at java.io.BufferedReader.readLine(BufferedReader.java:389)
        at htsjdk.samtools.util.BufferedLineReader.peek(BufferedLineReader.java:96)
        ... 39 more
    
    Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
        at scala.Option.foreach(Option.scala:257)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1981)
        at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1025)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
        at org.apache.spark.rdd.RDD.reduce(RDD.scala:1007)
        at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1150)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
        at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1127)
        at org.apache.spark.api.java.JavaRDDLike$class.treeAggregate(JavaRDDLike.scala:439)
        at org.apache.spark.api.java.AbstractJavaRDDLike.treeAggregate(JavaRDDLike.scala:45)
        at org.broadinstitute.hellbender.tools.spark.transforms.BaseRecalibratorSparkFn.apply(BaseRecalibratorSparkFn.java:39)
        at org.broadinstitute.hellbender.tools.spark.pipelines.ReadsPipelineSpark.runTool(ReadsPipelineSpark.java:133)
        at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:353)
        at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:38)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:115)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:170)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:189)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:131)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:152)
        at org.broadinstitute.hellbender.Main.main(Main.java:230)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
    Caused by: htsjdk.samtools.SAMException: Could not open sequence dictionary file: /user/liucheng/refs/ucsc.hg19.dict
        at htsjdk.samtools.reference.AbstractFastaSequenceFile.<init>(AbstractFastaSequenceFile.java:76)
        at htsjdk.samtools.reference.IndexedFastaSequenceFile.<init>(IndexedFastaSequenceFile.java:90)
        at htsjdk.samtools.reference.IndexedFastaSequenceFile.<init>(IndexedFastaSequenceFile.java:111)
        at htsjdk.samtools.reference.ReferenceSequenceFileFactory.getReferenceSequenceFile(ReferenceSequenceFileFactory.java:123)
        at htsjdk.samtools.reference.ReferenceSequenceFileFactory.getReferenceSequenceFile(ReferenceSequenceFileFactory.java:106)
        at htsjdk.samtools.reference.ReferenceSequenceFileFactory.getReferenceSequenceFile(ReferenceSequenceFileFactory.java:95)
        at org.broadinstitute.hellbender.engine.datasources.ReferenceHadoopSource.getReferenceBases(ReferenceHadoopSource.java:32)
        at org.broadinstitute.hellbender.engine.datasources.ReferenceMultiSource.getReferenceBases(ReferenceMultiSource.java:99)
        at org.broadinstitute.hellbender.engine.spark.AddContextDataToReadSpark$1.call(AddContextDataToReadSpark.java:121)
        at org.broadinstitute.hellbender.engine.spark.AddContextDataToReadSpark$1.call(AddContextDataToReadSpark.java:114)
        at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$3$1.apply(JavaRDDLike.scala:143)
        at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$3$1.apply(JavaRDDLike.scala:143)
        at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
        at scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30)
        at org.broadinstitute.hellbender.tools.spark.transforms.BaseRecalibratorSparkFn.lambda$apply$26a6df3e$1(BaseRecalibratorSparkFn.java:27)
        at org.broadinstitute.hellbender.tools.spark.transforms.BaseRecalibratorSparkFn$$Lambda$203/1698715390.call(Unknown Source)
        at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$4$1.apply(JavaRDDLike.scala:153)
        at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$4$1.apply(JavaRDDLike.scala:153)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:796)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:796)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
        at org.apache.spark.scheduler.Task.run(Task.scala:99)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
    Caused by: htsjdk.samtools.util.RuntimeIOException: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-796786996-222.201.145.253-1457530889871:blk_1074761745_1022461 file=/user/liucheng/refs/ucsc.hg19.dict
        at htsjdk.samtools.util.BufferedLineReader.peek(BufferedLineReader.java:98)
        at htsjdk.samtools.SAMTextHeaderCodec.advanceLine(SAMTextHeaderCodec.java:134)
        at htsjdk.samtools.SAMTextHeaderCodec.decode(SAMTextHeaderCodec.java:94)
        at htsjdk.samtools.reference.AbstractFastaSequenceFile.<init>(AbstractFastaSequenceFile.java:68)
        ... 36 more
    Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-796786996-222.201.145.253-1457530889871:blk_1074761745_1022461 file=/user/liucheng/refs/ucsc.hg19.dict
        at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:930)
        at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:609)
        at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:841)
        at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:889)
        at java.io.DataInputStream.read(DataInputStream.java:149)
        at java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:385)
        at hdfs.jsr203.HadoopFileSystem$3.read(HadoopFileSystem.java:478)
        at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65)
        at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109)
        at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103)
        at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
        at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
        at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
        at java.io.InputStreamReader.read(InputStreamReader.java:184)
        at java.io.BufferedReader.fill(BufferedReader.java:161)
        at java.io.BufferedReader.readLine(BufferedReader.java:324)
        at java.io.BufferedReader.readLine(BufferedReader.java:389)
        at htsjdk.samtools.util.BufferedLineReader.peek(BufferedLineReader.java:96)
        ... 39 more
    

    Issue · Github
    by Sheila

    Issue Number
    2372
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • SheilaSheila Broad InstituteMember, Broadie admin

    @liucheng
    Hi,

    Sorry for the delay. I will check with the team and get back to you.

    -Sheila

Sign In or Register to comment.