Attention:
The frontline support team will be slow on the forum because we are occupied with the GATK Workshop on March 21st and 22nd 2019. We will be back and more available to answer questions on the forum on March 25th 2019.

StructuralVariationDiscoveryPipelineSpark

mezewudomezewudo AtlantaMember

Hello:
I run into the errors below, when I try to run the StructuralVariationDiscoveryPipelineSpark to discover structural variants in a given sample:

ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 50)
java.lang.IllegalArgumentException: provided start is negative: -36

I am not sure why the tool has to throw up these errors when every other thing seems fine. I wiil appreciate if you could shed some light on this.

Best Answer

Answers

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    @mezewudo

    Can you please provide some more details?

    What version of GATK, what type of data are you using as inputs and how were they generated?

    Also a screenshot of the command that was run and the error log would be very helpful.

  • mezewudomezewudo AtlantaMember

    I used the GATK package 4.0.12.0 version and the command I ran was this:

    ./gatk StructuralVariationDiscoveryPipelineSpark -I SRR6930898_sorted.bam -R ref2.2bit --aligner-index-image ref2.img --kmers-to-ignore kmers_to_ignore.txt --contig-sam-file contigs.sam -O SRR6930898_structural_variants.vcf > out 2>error

    So I had as input: an alignment file (a sorted and indexed BAM file), a reference file in the .2bit format, an image file for the reference genome, a list of kmers to ignore built from the reference genome.

    I have attached all the stdout messages including the error messages to this post as the error.txt file.

    Thanks for your help..

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    Thanks for sending the information, I am going to ask the development team. This tool is in beta at the moment, so this could possibly be a bug.

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    @mezewudo

    I heard back from the development team and let's start wit the simplest options.

    1.) Is it possible that the bam file is corrupted? It looks like the error is looking at contigs with negative start positions.

    Try running the GATK tool ValidateSamFile

    2.) Second possibility lies with the reference you are using. What is the reference file? Is it from the Broad Resource Bundle? The development team recommends using the reference version without the alternate contigs included.

    Try those two options and respond when you get a chance.

  • mezewudomezewudo AtlantaMember

    So for

    1) I ran the ValidateSamFile on the bam file and it returned 'No errors found'

    2) the reference I am using is the H37Rv (NC_000962.3) reference file from NCBI. I am not sure it is in the Broad Resource Bundle. My analysis is not on human genome, but on a bacteria genome (Mycobacterium tuberculosis). I am wondering if this tool is not set up to analyze non-human genomes?

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    Hi @mezewudo

    I just wanted to let you know I am looking into this question.

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    @mezewudo
    Did you look into the bam to see if any of the positions in the bam file were negative? How was the bam created?

    Here are the ideas from the development team:

    It looks to me like the error is in QNameFinder.java line 48, where the code is encountering reads with negative start position.

    Secondly, a possibility caused by how the program works:

    ... the code is trying to construct an interval where the start is the read’s unclipped start. if the read aligns to the very beginning of a contig with some clipping, that could make the unclipped start negative. this would generally not happen with the primary contigs in the human reference, which all start with N’s. 
    
    ... If it’s a custom reference or a microbe or something, a workaround might be to use the `exclusion-intervals` parameter to create intervals to ignore for breakpoint detection. I’d put in the first couple hundred bases of each contig.
    
    
  • mezewudomezewudo AtlantaMember

    Thanks for your help, I used BWA MEM default settings to create the alignment file and converted to a BAM file using picard tools. I looked at the alignment file, and it seems that some values in field 9 (insert size) in the SAM format file has negative values, I am not sure if that means anything. Below is a snapshot of a row in the alignment file:

    SRR6930898.1279304 147 NC_000962 893578 60 33S218M = 893337 -459 GGGCGGGAAGAACGAGAGGCACAATCAGAGGGACAAGCAGCAACCGGACAGGCTAGACGAGGGCAGGCACGTGGTGGAGCTGCAACCGTATGGGGGAGTTTGGCTGCACTCCTGGCTGGATCGCGATCTGGGCATCAGCGGGCGGCTATCGGTGCGTGACGGTACCGGGGTCAGCCACCGGCTGGTCCGGATCGACGACCCGATCCTGCGGGTGCAGCAGCTGGCGATTCACCTGGCCGAGGAACGAAAGT -----///-----9////////9///--////-/:;///--;---//9-;//9------/---//9--;-----/99///9--A9--:/;-99.00;9//.9/000.:....../00::;--:./;.<0/<00<.-@<;-<<<.1<//<//<</0/?/<C<<[email protected]//?B<>0/F>//>//>>0/>///0>>B//>>//B///AB////A/A0B011111B000A003D111B001A1111B11B1131A>1> NM:i:11 MD:Z:

    Please could you also clarify a bit on the 'exclusion-intervals' option, the microbe reference genome I am using does not have contigs, rather it is one block of about 4 million nucleotide base sequence. How would I create intervals to ignore breakpoints in this scenario.

    Thanks.

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    @mezewudo This is a little bit odd. I am curious if you could please share the commands that you are running? I have provided this information back to the development team for more advice.

  • mezewudomezewudo AtlantaMember

    The command I ran basically is:

    ./gatk StructuralVariationDiscoveryPipelineSpark -I my_sorted.bam -R my_reference.2bit --aligner-index-image my_reference.img --kmers-to-ignore kmers_to_ignore.txt --contig-sam-file aligned_contigs.sam -O structural_variants.vcf

    I created the reference image (.img) file and the kmers_to_ignore text file using the BwaMemIndexImageCreator andFindBadGenomicGenomicKmersSpark respectively as decribed in the tool documentation.

    The reference genome I used was pulled from NCBI with the accession number NC_000962.3 in fatsa format, before I converted to .2bit file.

    The input fastq file I supplied to BWA to create the bam file, was from illumina sequence and were paired end reads.

    I really think this has to do with my sample being a microbial genome instead of say a regular human genome which the development team may have been used to for the most part.

    My suggestion is, could the development team kindly on their end, run this tool on any Mycobacterium tuberculosis publicly available sample and go from start to finish, to actually see what the differences and possible issues could be in the analysis process.

    Thanks.

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    If the microbial genome is circular, there are probably reads that align over the origin of the circle (the coordinate marked 0). BWA will clip these to start at 0, so their unclipped start position will be negative. To work around this bug, you can create an exclusion interval for the first several hundred bases of the contig using exclusion-intervals. You can just do this for the one long contig. Or you can add a few hundred N's to the front of the contig and see if that works.

  • mezewudomezewudo AtlantaMember

    Alright, just to be sure, how will the exclusion interval file format look like. Would it be some tab delimited text file with coordinates for the intervals like for example:

    1 7500
    2000 35000
    7000 85000

    Or better still could you share an example of an exclusion-intervals file, so I could see the format and adapt it to the sample I want to analyze.

    Thanks.

  • mezewudomezewudo AtlantaMember

    Thanks, the -XL seems to have solved this issue.

Sign In or Register to comment.