Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Running GATK WDL on FireCloud with TCGA controlled bam files

Hi, GATK team!

I have an issue with GATK4.0 pipeline when running analysis on FireCloud.

I am going to run GATK with TCGA controlled mRNASeq bam files. As far as I concerned, FireCloud offers TCGA level 1 bam files which named as .sorted_genome_alignments.bam‎. So I ran the pipeline from the step MarkDuplicates according to the rnaseq-germline-snps-indels, the public WDL example as GATK team put forward on FireCloud. Then I set proper parameters and workspace’s attributes in configuration, especially the reference fasta as gs://broad-references/Homo_sapiens_assembly19_1000genomes_decoy/Homo_sapiens_assembly19_1000genomes_decoy.fasta‎, but I got some error reported like this:

[Mon Jun 11 04:59:58 UTC 2018] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 18.94 minutes.
Runtime.totalMemory()=24761073664
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
picard.PicardException: This program requires input that are either coordinate or query sorted. Found unsorted
    at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:254)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:269)
    at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:25)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
    at org.broadinstitute.hellbender.Main.main(Main.java:289)
Using GATK jar /gatk/build/libs/gatk-package-4.0.3.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx32G -jar /gatk/build/libs/gatk-package-4.0.3.0-local.jar MarkDuplicates --INPUT /cromwell_root/5aa919de-0aa0-43ec-9ec3-288481102b6d/tcga/LUAD/RNA/RNA-Seq/UNC-LCCC/ILLUMINA/UNCID_1209060.e6a101b9-61f9-4ed1-a59f-d9db3fdb4555.sorted_genome_alignments.bam --OUTPUT UNCID_1209060.e6a101b9-61f9-4ed1-a59f-d9db3fdb4555.sorted_genome_alignments.dedupped.bam --CREATE_INDEX true --VALIDATION_STRINGENCY SILENT --METRICS_FILE UNCID_1209060.e6a101b9-61f9-4ed1-a59f-d9db3fdb4555.sorted_genome_alignments.dedupped.metrics

Since the name of input BAM file including “sorted”, I thought it was reasonable to add the option “--ASSUME_SORTED” after I searched some solutions that other people and GATK’s staff posted. Then MarkDuplicates step finally worked. But in the next step SplitNCigarReads, error occurred like this:

INFO  12:59:01,086 HelpFormatter - Program Args: -T SplitNCigarReads -R /cromwell_root/broad-references/Homo_sapiens_assembly19_1000genomes_decoy/Homo_sapiens_assembly19_1000genomes_decoy.fasta -I /cromwell_root/fc-85926c4b-dcec-49b1-a0b1-446abe208477/e44da35f-1087-423f-95ea-53944c30c5f2/RNAseq/e5b8550b-f301-47b7-a709-f6d91554ab6f/call-MarkDuplicates/UNCID_1209060.e6a101b9-61f9-4ed1-a59f-d9db3fdb4555.sorted_genome_alignments.dedupped.bam -o UNCID_1209060.e6a101b9-61f9-4ed1-a59f-d9db3fdb4555.sorted_genome_alignments.split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS 
INFO  12:59:01,089 HelpFormatter - Executing as [email protected] on Linux 4.9.0-0.bpo.6-amd64 amd64; OpenJDK 64-Bit Server VM 1.8.0_111-8u111-b14-2~bpo8+1-b14. 
INFO  12:59:01,089 HelpFormatter - Date/Time: 2018/06/19 12:59:01 
INFO  12:59:01,090 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  12:59:01,090 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  12:59:01,234 GenomeAnalysisEngine - Strictness is SILENT 
INFO  12:59:01,292 GenomeAnalysisEngine - Downsampling Settings: No downsampling 
INFO  12:59:01,298 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
WARNING: BAM index file /cromwell_root/fc-85926c4b-dcec-49b1-a0b1-446abe208477/e44da35f-1087-423f-95ea-53944c30c5f2/RNAseq/e5b8550b-f301-47b7-a709-f6d91554ab6f/call-MarkDuplicates/UNCID_1209060.e6a101b9-61f9-4ed1-a59f-d9db3fdb4555.sorted_genome_alignments.dedupped.bai is older than BAM /cromwell_root/fc-85926c4b-dcec-49b1-a0b1-446abe208477/e44da35f-1087-423f-95ea-53944c30c5f2/RNAseq/e5b8550b-f301-47b7-a709-f6d91554ab6f/call-MarkDuplicates/UNCID_1209060.e6a101b9-61f9-4ed1-a59f-d9db3fdb4555.sorted_genome_alignments.dedupped.bam
INFO  12:59:01,319 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.02 
INFO  12:59:02,073 GATKRunReport - Uploaded run statistics report to AWS S3 
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 3.5-0-g36282e4): 
##### ERROR
##### ERROR This means that one or more arguments or inputs in your command are incorrect.
##### ERROR The error message below tells you what is the problem.
##### ERROR
##### ERROR If the problem is an invalid argument, please check the online documentation guide
##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
##### ERROR
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
##### ERROR
##### ERROR MESSAGE: Lexicographically sorted human genome sequence detected in reads. Please see http://gatkforums.broadinstitute.org/discussion/58/companion-utilities-reordersamfor more information. Error details: reads contigs = [chr1, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr2, chr20, chr21, chr22, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chrM_rCRS, chrX, chrY]
##### ERROR ------------------------------------------------------------------------------------------

I tried to fix the bug by adding ReorderSam step as the error message mentioned. The reference fasta I used was still Homo_sapiens_assembly19_1000genomes_decoy.fasta‎, but it still didn’t work well. The error message was:

[Mon Jun 11 15:30:14 UTC 2018] picard.sam.ReorderSam done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=665845760
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
picard.PicardException: New reference sequence does not contain a matching contig for chr1
    at picard.sam.ReorderSam.buildSequenceDictionaryMap(ReorderSam.java:263)
    at picard.sam.ReorderSam.doWork(ReorderSam.java:146)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:269)
    at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:25)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
    at org.broadinstitute.hellbender.Main.main(Main.java:289)
Using GATK jar /gatk/build/libs/gatk-package-4.0.3.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx32G -jar /gatk/build/libs/gatk-package-4.0.3.0-local.jar ReorderSam --INPUT /cromwell_root/fc-85926c4b-dcec-49b1-a0b1-446abe208477/64c3a79b-04f8-46f8-a238-8717380c7768/RNAseq/4e8ce380-6f4b-41f6-b1d2-4fe11ed8fa68/call-MarkDuplicates/UNCID_1209060.e6a101b9-61f9-4ed1-a59f-d9db3fdb4555.sorted_genome_alignments.dedupped.bam --OUTPUT UNCID_1209060.e6a101b9-61f9-4ed1-a59f-d9db3fdb4555.sorted_genome_alignments.reorder.bam -R /cromwell_root/broad-references/Homo_sapiens_assembly19_1000genomes_decoy/Homo_sapiens_assembly19_1000genomes_decoy.fasta --CREATE_INDEX true

And then I added the option --VALIDATION_STRINGENCY LENIENT --ALLOW_INCOMPLETE_DICT_CONCORDANCE, I got this:

Ignoring SAM validation error: ERROR: Record 178984837, Read name UNC9-SN296_246:4:1107:4151:192010/2, Mapped mate should have mate reference name Ignoring SAM validation error: ERROR: Record 178984905, Read name UNC9-SN296_246:4:2205:17136:94561/2, Mapped mate should have mate reference name INFO 2018-06-13 03:23:44 ReorderSam Wrote 186956859 reads [Wed Jun 13 03:23:46 UTC 2018] picard.sam.ReorderSam done. Elapsed time: 40.71 minutes. Runtime.totalMemory()=10954997760 To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp htsjdk.samtools.SAMException: Exception when processing alignment for BAM index UNC9-SN296_246:4:1101:10000:103197/2 2/2 50b aligned read. at htsjdk.samtools.BAMFileWriter.writeAlignment(BAMFileWriter.java:140) at htsjdk.samtools.SAMFileWriterImpl.close(SAMFileWriterImpl.java:226) at htsjdk.samtools.AsyncSAMFileWriter.synchronouslyClose(AsyncSAMFileWriter.java:38) at htsjdk.samtools.util.AbstractAsyncWriter.close(AbstractAsyncWriter.java:89) at picard.sam.ReorderSam.doWork(ReorderSam.java:167) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:269) at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:25) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203) at org.broadinstitute.hellbender.Main.main(Main.java:289) Caused by: htsjdk.samtools.SAMException: Exception creating BAM index for record UNC9-SN296_246:4:1101:10000:103197/2 2/2 50b aligned read. at htsjdk.samtools.BAMIndexer.processAlignment(BAMIndexer.java:119) at htsjdk.samtools.BAMFileWriter.writeAlignment(BAMFileWriter.java:137) ... 9 more Caused by: htsjdk.samtools.SAMException: Unexpected reference -1 when constructing index for 0 for record UNC9-SN296_246:4:1101:10000:103197/2 2/2 50b aligned read. at htsjdk.samtools.BAMIndexer$BAMIndexBuilder.processAlignment(BAMIndexer.java:218) at htsjdk.samtools.BAMIndexer.processAlignment(BAMIndexer.java:117) ... 10 more Using GATK jar /gatk/build/libs/gatk-package-4.0.3.0-local.jar Running: java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx32G -jar /gatk/build/libs/gatk-package-4.0.3.0-local.jar ReorderSam --INPUT /cromwell_root/fc-85926c4b-dcec-49b1-a0b1-446abe208477/4cc0531d-5121-4140-8344-f38235f035fd/RNAseq/f7b4b882-effb-43ed-a70a-76720b2d8772/call-MarkDuplicates/UNCID_1209060.e6a101b9-61f9-4ed1-a59f-d9db3fdb4555.sorted_genome_alignments.dedupped.bam --OUTPUT UNCID_1209060.e6a101b9-61f9-4ed1-a59f-d9db3fdb4555.sorted_genome_alignments.reorder.bam -R /cromwell_root/broad-references/Homo_sapiens_assembly19_1000genomes_decoy/Homo_sapiens_assembly19_1000genomes_decoy.fasta --CREATE_INDEX true --VALIDATION_STRINGENCY LENIENT --ALLOW_INCOMPLETE_DICT_CONCORDANCE

It seemed that the input sorted_genome_alignment_bam file had aligned to another reference fasta different from my pipeline used. Although I looked through some metadata that TCGA provided in their official website and description of the TCGA controlled access workspace in FireCloud, I couldn’t find the information of specific reference fasta file.

Could you please provide some help to solve the problem?

Tagged:

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @lvqi_0172
    Hi,

    It seems like your BAM file is malformed. Can you try running ValidateSamFile on it?

    -Sheila

  • @Sheila
    Hi,
    Thank you for the reply. I tried to ValidateSamFile, but it failed without error message. WDL error logged like this:

    Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/cromwell_root/tmp.ed51ba28
    03:25:16.997 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/build/libs/gatk-package-4.0.3.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
    [Mon Jun 25 03:25:17 UTC 2018] ValidateSamFile  --INPUT /cromwell_root/fc-84b9e4b6-909b-4ac4-bc7a-ecb6d856bf3f/2b3b5404-126e-4200-ae7a-d4c00e8ef498/RNAseq/d33fa329-2c94-42ff-b34e-643ce870ba95/call-MarkDuplicates/UNCID_1209060.e6a101b9-61f9-4ed1-a59f-d9db3fdb4555.sorted_genome_alignments.dedupped.bam --OUTPUT UNCID_1209060.e6a101b9-61f9-4ed1-a59f-d9db3fdb4555.sorted_genome_alignments.validated.bam --MODE SUMMARY  --MAX_OUTPUT 100 --IGNORE_WARNINGS false --VALIDATE_INDEX true --INDEX_VALIDATION_STRINGENCY EXHAUSTIVE --IS_BISULFITE_SEQUENCED false --MAX_OPEN_TEMP_FILES 8000 --SKIP_MATE_VALIDATION false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
    [Mon Jun 25 03:25:17 UTC 2018] Executing as [email protected] on Linux 4.9.0-0.bpo.6-amd64 amd64; OpenJDK 64-Bit Server VM 1.8.0_131-8u131-b11-2ubuntu1.16.04.3-b11; Deflater: Intel; Inflater: Intel; Picard version: Version:4.0.3.0
    INFO    2018-06-25 03:26:09 SamFileValidator    Validated Read    10,000,000 records.  Elapsed time: 00:00:52s.  Time for last 10,000,000:   52s.  Last read position: chr1:148,347,144
    INFO    2018-06-25 03:27:03 SamFileValidator    Validated Read    20,000,000 records.  Elapsed time: 00:01:46s.  Time for last 10,000,000:   53s.  Last read position: chr1:214,478,560
    INFO    2018-06-25 03:28:02 SamFileValidator    Validated Read    30,000,000 records.  Elapsed time: 00:02:45s.  Time for last 10,000,000:   58s.  Last read position: chr11:16,996,437
    INFO    2018-06-25 03:28:56 SamFileValidator    Validated Read    40,000,000 records.  Elapsed time: 00:03:39s.  Time for last 10,000,000:   54s.  Last read position: chr12:21,670,704
    INFO    2018-06-25 03:29:51 SamFileValidator    Validated Read    50,000,000 records.  Elapsed time: 00:04:34s.  Time for last 10,000,000:   54s.  Last read position: chr14:24,095,115
    INFO    2018-06-25 03:30:41 SamFileValidator    Validated Read    60,000,000 records.  Elapsed time: 00:05:24s.  Time for last 10,000,000:   49s.  Last read position: chr16:619,140
    INFO    2018-06-25 03:31:28 SamFileValidator    Validated Read    70,000,000 records.  Elapsed time: 00:06:10s.  Time for last 10,000,000:   46s.  Last read position: chr17:33,478,163
    INFO    2018-06-25 03:32:21 SamFileValidator    Validated Read    80,000,000 records.  Elapsed time: 00:07:04s.  Time for last 10,000,000:   53s.  Last read position: chr19:682,815
    INFO    2018-06-25 03:33:15 SamFileValidator    Validated Read    90,000,000 records.  Elapsed time: 00:07:58s.  Time for last 10,000,000:   53s.  Last read position: chr2:3,627,729
    INFO    2018-06-25 03:34:11 SamFileValidator    Validated Read   100,000,000 records.  Elapsed time: 00:08:54s.  Time for last 10,000,000:   56s.  Last read position: chr2:175,585,069
    INFO    2018-06-25 03:35:08 SamFileValidator    Validated Read   110,000,000 records.  Elapsed time: 00:09:51s.  Time for last 10,000,000:   56s.  Last read position: chr22:25,791,798
    INFO    2018-06-25 03:35:58 SamFileValidator    Validated Read   120,000,000 records.  Elapsed time: 00:10:41s.  Time for last 10,000,000:   50s.  Last read position: chr4:24,801,241
    INFO    2018-06-25 03:36:45 SamFileValidator    Validated Read   130,000,000 records.  Elapsed time: 00:11:28s.  Time for last 10,000,000:   46s.  Last read position: chr5:135,399,186
    INFO    2018-06-25 03:37:37 SamFileValidator    Validated Read   140,000,000 records.  Elapsed time: 00:12:20s.  Time for last 10,000,000:   51s.  Last read position: chr6:42,980,760
    INFO    2018-06-25 03:38:28 SamFileValidator    Validated Read   150,000,000 records.  Elapsed time: 00:13:11s.  Time for last 10,000,000:   50s.  Last read position: chr7:65,751,634
    INFO    2018-06-25 03:39:18 SamFileValidator    Validated Read   160,000,000 records.  Elapsed time: 00:14:01s.  Time for last 10,000,000:   50s.  Last read position: chr8:134,262,716
    INFO    2018-06-25 03:40:06 SamFileValidator    Validated Read   170,000,000 records.  Elapsed time: 00:14:49s.  Time for last 10,000,000:   47s.  Last read position: chrM_rCRS:7,767
    INFO    2018-06-25 03:40:53 SamFileValidator    Validated Read   180,000,000 records.  Elapsed time: 00:15:36s.  Time for last 10,000,000:   46s.  Last read position: */*
    Using GATK jar /gatk/build/libs/gatk-package-4.0.3.0-local.jar
    Running:
        java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -XX:GCHeapFreeLimit=10 -Xms8000m -Xmx50G -jar /gatk/build/libs/gatk-package-4.0.3.0-local.jar ValidateSamFile --INPUT /cromwell_root/fc-84b9e4b6-909b-4ac4-bc7a-ecb6d856bf3f/2b3b5404-126e-4200-ae7a-d4c00e8ef498/RNAseq/d33fa329-2c94-42ff-b34e-643ce870ba95/call-MarkDuplicates/UNCID_1209060.e6a101b9-61f9-4ed1-a59f-d9db3fdb4555.sorted_genome_alignments.dedupped.bam --OUTPUT UNCID_1209060.e6a101b9-61f9-4ed1-a59f-d9db3fdb4555.sorted_genome_alignments.validated.bam --MODE SUMMARY
    

    I felt confused with this problem. Any other parameters should I add to?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @lvqi_0172
    Hi,

    That is not a failure. Those statements are just telling you some information about what the tool is doing. Can you try running it on your laptop without WDL?

    Thanks,
    Sheila

  • lvqi_0172lvqi_0172 Member
    edited July 2018

    @Sheila
    Hi,

    I tried running ValidateSamFile both on FireCloud and my local server with UNCID_1209060.e6a101b9-61f9-4ed1-a59f-d9db3fdb4555.sorted_genome_alignments.bam. Both failed as GC overhead limit exceeded, but I tried to add more virual memory to 80G for this procedure.
    Then I tried SortSam again with the input Bam file. When I set the parameter as

    command <<<
            ${gatk_path} --java-options "-Xmx32G" \
                MarkDuplicates \
                --INPUT ${input_bam} \
                --OUTPUT ${base_name}.bam  \
                --CREATE_INDEX true \
                --VALIDATION_STRINGENCY LENIENT \
                --METRICS_FILE ${base_name}.metrics \
                --ASSUME_SORTED \
    

    WDL error logged so many lines like

    Ignoring SAM validation error: ERROR: Record 178984594, Read name UNC9-SN296_246:4:2104:8295:128840/2, Mapped mate should have mate reference name 
    Ignoring SAM validation error: ERROR: Record 178984612, Read name UNC9-SN296_246:4:1205:16305:36126/1, Mapped mate should have mate reference name 
    Ignoring SAM validation error: ERROR: Record 178984621, Read name UNC9-SN296_246:4:1203:7070:134149/2, Mapped mate should have mate reference name 
    Ignoring SAM validation error: ERROR: Record 178984662, Read name UNC9-SN296_246:4:1104:14888:193241/1, Mapped mate should have mate reference name 
    Ignoring SAM validation error: ERROR: Record 178984768, Read name UNC9-SN296_246:4:1107:12715:86028/1, Mapped mate should have mate reference name
    
    

    and it stalled for a long time without more output loggings, then I had to abort the submission.

    So I wonder if there is something wrong with the original TCGA alignment bam files? How should I deal with these files?

    Thanks,
    Lv Qi

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited July 2018

    Hi @lvqi_0172,

    Your read names indicate they were preprocessed incorrectly.

    UNC9-SN296_246:4:2104:8295:128840/2
    UNC9-SN296_246:4:1205:16305:36126/1

    The /2 and /1 notations are typically stripped during preprocessing (during alignment) and correspond to mates of reads. Article#6484 illustrate these notations and I believe the Comments section further discuss dealing with these when things go wrong.

    Do you know what aligner and which version of the aligner this data was preprocessed with? You should be able to pull this information out from the alignment file header (grep '@PG'). Either BWA or STAR should have correctly handled these, assuming the pair reads files were provided correctly to the aligners. If your SAM flag values and other alignment metadata tags are correct, then you could potentially strip these. Otherwise, you may want to consider reverting and realigning the reads.

Sign In or Register to comment.