Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office on October 14, 2019, due to the U.S. holiday. We will return to monitoring the forum on October 15.

more 512M chromosome problem in GATK

skenoyskenoy chinaMember
edited April 2016 in Ask the GATK team

i use GATK to deal with Wheat_survey genome which have a 3B chromosome(700+M), it can`t call snp from 512M to 700M use default paramaters, how fix it???

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera admin Cambridge, MAMember, Administrator, Broadie admin

    Have you checked that 1) that region is callable (i.e. that genome region is not just a mess of repeats) and that 2) you have sequencing data in that region? Some parts of some genomes are difficult to sequence, especially the ends of chromosomes and the centromeric regions. Has anyone else previously successfully called variants in those areas? Can you show screenshots of those regions where you think variants should be called but aren't?

  • skenoyskenoy chinaMember
    edited April 2016

    1
    a
    2
    a
    3
    a
    it`s have more reads from 512M to 700+M

    Post edited by Geraldine_VdAuwera on
  • skenoyskenoy chinaMember

    if you can`t see pics, plz copy url to brower to open it

  • Geraldine_VdAuweraGeraldine_VdAuwera admin Cambridge, MAMember, Administrator, Broadie admin

    These pictures are not informative. Try to answer the questions and show screenshots of a genome browser like IGV. Otherwise we cannot help you.

  • 570932004570932004 ChinaMember

    We've been having the same problem.
    An error occurs whenever we try to run realign or call variant. Everything goes well until the tool reaches around 512M-th base of a chromosome (Usually somewhere around ChrNN : 536,6**,***) .

    Here is what the error looks like:
    INFO 14:43:34,320 ProgressMeter - Chr00:515278293 4.40060294E8 5.5 h 45.0 s 84.1% 6.6 h 62.4 m
    INFO 14:44:04,321 ProgressMeter - Chr00:518557516 4.40460298E8 5.5 h 45.0 s 84.2% 6.6 h 62.1 m
    INFO 14:44:34,322 ProgressMeter - Chr00:522748942 4.41060304E8 5.5 h 45.0 s 84.4% 6.6 h 61.6 m
    INFO 14:45:04,323 ProgressMeter - Chr00:528086698 4.41760311E8 5.5 h 45.0 s 84.5% 6.6 h 61.0 m
    INFO 14:45:34,324 ProgressMeter - Chr00:531814739 4.42260319E8 5.6 h 45.0 s 84.6% 6.6 h 60.6 m
    INFO 14:46:04,347 ProgressMeter - Chr00:536621269 4.42960327E8 5.6 h 45.0 s 84.8% 6.6 h 60.1 m

    ERROR ------------------------------------------------------------------------------------------
    ERROR A BAM/CRAM ERROR has occurred (version 3.5-0-g36282e4):
    ERROR
    ERROR This means that there is something wrong with the BAM/CRAM file(s) you provided.
    ERROR The error message below tells you what is the problem.
    ERROR
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR Please do NOT post this error to the GATK forum until you have followed these instructions:
    ERROR - Make sure that your BAM file is well-formed by running Picard's validator on it
    ERROR (see http://picard.sourceforge.net/command-line-overview.shtml#ValidateSamFile for details)
    ERROR - Ensure that your BAM index is not corrupted: delete the current one and regenerate it with 'samtools index'
    ERROR - Ensure that your CRAM index is not corrupted: delete the current one and regenerate it with
    ERROR 'java -jar cramtools-3.0.jar index --bam-style-index --input-file --reference-fasta-file '
    ERROR (see https://github.com/enasequence/cramtools/tree/v3.0 for details)
    ERROR
    ERROR MESSAGE: Exception when processing alignment for BAM index ST-E00200:258:HYT3GCCXX:8:2101:22516:66267 2/2 150b aligned read.
    ERROR ------------------------------------------------------------------------------------------

    If we manually split the large chromosome into smaller contigs, the tool is able to call variants throughout the entire region, so we think there might be a bug somewhere.

    Issue · Github
    by Sheila

    Issue Number
    1264
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
  • Geraldine_VdAuweraGeraldine_VdAuwera admin Cambridge, MAMember, Administrator, Broadie admin
    Hmm, this could be due to some data structure with a predetermined size that's insufficient. GATK was designed for and is tested on exclusively human genome data so the original devs wouldn't have imagined we'd need to deal with such large chromosomes. Would you be able to provide us with a test case that we can use to troubleshoot this locally? @Sheila can provide you with instructions.
  • SheilaSheila admin Broad InstituteMember, Broadie, Moderator admin

    @570932004
    Hi,

    Instructions for uploading test data are here.

    -Sheila

  • 570932004570932004 ChinaMember

    Hi, thanks for your reply.
    I just read from picard FAQ -- "The BAM index format imposes a limit of 512MB for a single reference sequence."
    Not sure how I missed this important information before but it sure seems to be pointing the problem away from GATK itself...
    Should I proceed with reproducing the error anyway?

  • SheilaSheila admin Broad InstituteMember, Broadie, Moderator admin

    @570932004
    Hi,

    Yes, please submit a bug report. We are interested in improving the tools' capabilities when it comes to large references and large contigs.

    Thanks,
    Sheila

  • 570932004570932004 ChinaMember
    edited September 2016

    Hi, after performing some tests, I'd like to give an update to the problem.

    It seems that the variant-calling step itself works just fine --- I tested HC(both normal and GVCF modes) and UG, and was able to finish the process without any errors.
    However, certain pre-processing steps failed:
    1. Picard Tools SortSam/Markduplicates fails to index bam files if reads were aligned with a start position > 512M. -- This is a known issue from FAQ, which throws a java exception "java.lang.ArrayIndexOutOfBoundsException". Fortunately, CREATE_INDEX is false by default, and we can create the bam index with samtools.
    2. Similar failures(I guess...) were observed when using GATK's PrintReads(That's right, I had a hard time trying to create snippet files! ) and IndelRealigner tool (Haven't had the time to test other programs, sorry) .The program always stops at around 512M(536,870,912)-th base of the large chromosome, and the error message is not as meaningful as what was seen from Picard Tools. Again, we can use --disable_bam_indexing to disable bam indexing and use samtools instead. Also, according to the latest Best Practices, it seems we can skip IndelRealigner altogether when using HG, which is nice :)
    So I guess the 512M limitation is due to some data structure in the java programs. I'm surprised that the variant-calling step wasn't affected. I will try to perform more tests with my current work-around and submit a bug report as soon as I can.

    Thank you.

  • 570932004570932004 ChinaMember

    Hi,
    Bug report has been uploaded with file name Tae-gatk-debug.tar.gz .
    Thank you

    Issue · Github
    by Sheila

    Issue Number
    1313
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • 570932004570932004 ChinaMember

    Hi,
    After having more fun with PrintReads / RealignerTargetCreator / IndelRealigner , I found that one or more(hopefully not all) of these tools could not properly handle reads that were mapped beyond 512M. All three programs can finish their processes without any errors, but the output is incorrect.

    The snippet files I mentioned above were created using PrintReads with interval option "-L 3B:520000000-540000000" , but no reads were extracted with alignment positions beyond 536,6,*** or 536,7,*** (The last alignment record was different for the two files, which is strange). I can confirm that reads were mapped all the way to 3B:773,, (Full length of chromosome 3B is 774,434,471) in the original BAM files. I tried other intervals, and found that if the start position was somewhat beyond 512M (e.g, 540,000,000-550,000,000 or 770,000,000-773,000,000 where I know there are aligned reads), the resulting BAM output would contain only a BAM header.

    I then ran RealignerTargetCreator and IndelRealigner on the original BAM files. According to Best-practices, this is no longer required when using HC, but since I wanted to test both HC and UG, I performed realignment anyway. I experienced a strange behaviour with RealignerTargetCreator : Despite completing without errors, the output interval file ends at around 512M (3B:536841142-536841143 and 3B:536873793-536873804, intervals from full BAM files are consistent with the snippet). I then used these interval files to run IndelRealigner, which also completed without errors. IndelRealigner was able to output all reads regardless of alignment position, but given so many errors encountered I'm not so confident with the results.

    Using the realigned BAM, I was able to call variants with both HC and UG. Though I'm not familiar with how these different tools handle BAM / BAM-index files , this is probably worth looking into.

    To sum up, the work-around I proposed earlier seems error-prone and not well-validated, so I'd be very cautious with the variant results produced. I guess currently the safest way is to split the long reference as recommended in picard FAQ. I hope GATK and picard can get patches to fix this problem, and am willing to provide a larger subset of my data if neccessary.

    Thank you,
    Yue

  • SheilaSheila admin Broad InstituteMember, Broadie, Moderator admin

    @570932004
    Hi Yue,

    Thanks for the files. I will have a look soon.

    -Sheila

  • 570932004570932004 ChinaMember

    Hi,
    Any chance we are getting a fix on GATK / Picard concerning this issue? It is not uncommon for plants to have large chromosomes.

  • SheilaSheila admin Broad InstituteMember, Broadie, Moderator admin

    @570932004
    Hi,

    We are looking to have a fix in GATK4.

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera admin Cambridge, MAMember, Administrator, Broadie admin

    Hi @570932004, there's some work underway right now in htsjdk and Picard to get the tools to behave better on large non-human genomes (they're using sugar pine as test case) so some performance issues at least are being addressed; and the fixes will be ported to GATK4. I'm not sure if the specific contig size you're dealing with will be addressed as well but I'll ask. Is the genome file you sent us publicly available? If not, can we share it with the other development teams?

  • 570932004570932004 ChinaMember

    @Geraldine_VdAuwera
    Thanks for the update about ongoing developments. Yes, the genome is publicly available at Ensembl Plants(link below).
    http://plants.ensembl.org/Triticum_aestivum/Info/Index
    Release-30 was used in our analysis, current is Release-33.

    -Yue

  • Geraldine_VdAuweraGeraldine_VdAuwera admin Cambridge, MAMember, Administrator, Broadie admin

    Thanks @570932004, that is helpful. Just one question to clarify: if we use one of the files at this link: ftp://ftp.ensemblgenomes.org/pub/plants/release-33/fasta/triticum_aestivum/dna/

    Which one should we use, of the combination of dna/dna_rm/dna_sm and nonchromosomal/toplevel? We're not used to this nomenclature so it's a little confusing.

  • 570932004570932004 ChinaMember

    @Geraldine_VdAuwera
    Sorry, my bad for not giving a more detailed explanation.
    We used :
    ftp://ftp.ensemblgenomes.org/pub/plants/release-33/fasta/triticum_aestivum/dna/Triticum_aestivum.TGACv1.dna_sm.toplevel.fa.gz

    Actually nonchromosomal/toplevel for wheat is the same (and from my experience, usually they are same for other plants too). dna_sm means repeats and low complexity regions are lower-cased, while dna_rm means these regions are replaced with Ns.

    -Yue

  • Geraldine_VdAuweraGeraldine_VdAuwera admin Cambridge, MAMember, Administrator, Broadie admin

    Thanks Yue! FYI we brought this up with some external developers who have been making performance improvements to Picard tools; the discussion is at https://github.com/broadinstitute/picard/pull/687 .

  • 570932004570932004 ChinaMember

    We've been having the same problem.
    An error occurs whenever we try to run realign or call variant. Everything goes well until the tool reaches around 512M-th base of a chromosome (Usually somewhere around ChrNN : 536,6**,***) .

    Here is what the error looks like:
    INFO 14:43:34,320 ProgressMeter - Chr00:515278293 4.40060294E8 5.5 h 45.0 s 84.1% 6.6 h 62.4 m
    INFO 14:44:04,321 ProgressMeter - Chr00:518557516 4.40460298E8 5.5 h 45.0 s 84.2% 6.6 h 62.1 m
    INFO 14:44:34,322 ProgressMeter - Chr00:522748942 4.41060304E8 5.5 h 45.0 s 84.4% 6.6 h 61.6 m
    INFO 14:45:04,323 ProgressMeter - Chr00:528086698 4.41760311E8 5.5 h 45.0 s 84.5% 6.6 h 61.0 m
    INFO 14:45:34,324 ProgressMeter - Chr00:531814739 4.42260319E8 5.6 h 45.0 s 84.6% 6.6 h 60.6 m
    INFO 14:46:04,347 ProgressMeter - Chr00:536621269 4.42960327E8 5.6 h 45.0 s 84.8% 6.6 h 60.1 m

    ERROR ------------------------------------------------------------------------------------------
    ERROR A BAM/CRAM ERROR has occurred (version 3.5-0-g36282e4):
    ERROR
    ERROR This means that there is something wrong with the BAM/CRAM file(s) you provided.
    ERROR The error message below tells you what is the problem.
    ERROR
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR Please do NOT post this error to the GATK forum until you have followed these instructions:
    ERROR - Make sure that your BAM file is well-formed by running Picard's validator on it
    ERROR (see http://picard.sourceforge.net/command-line-overview.shtml#ValidateSamFile for details)
    ERROR - Ensure that your BAM index is not corrupted: delete the current one and regenerate it with 'samtools index'
    ERROR - Ensure that your CRAM index is not corrupted: delete the current one and regenerate it with
    ERROR 'java -jar cramtools-3.0.jar index --bam-style-index --input-file --reference-fasta-file '
    ERROR (see https://github.com/enasequence/cramtools/tree/v3.0 for details)
    ERROR
    ERROR MESSAGE: Exception when processing alignment for BAM index ST-E00200:258:HYT3GCCXX:8:2101:22516:66267 2/2 150b aligned read.
    ERROR ------------------------------------------------------------------------------------------

    If we manually split the large chromosome into smaller contigs, the tool is able to call variants throughout the entire region, so we think there might be a bug somewhere.

  • 570932004570932004 ChinaMember

    We've been having the same problem.
    An error occurs whenever we try to run realign or call variant. Everything goes well until the tool reaches around 512M-th base of a chromosome (Usually somewhere around ChrNN : 536,6**,***) .

    Here is what the error looks like:
    INFO 14:43:34,320 ProgressMeter - Chr00:515278293 4.40060294E8 5.5 h 45.0 s 84.1% 6.6 h 62.4 m
    INFO 14:44:04,321 ProgressMeter - Chr00:518557516 4.40460298E8 5.5 h 45.0 s 84.2% 6.6 h 62.1 m
    INFO 14:44:34,322 ProgressMeter - Chr00:522748942 4.41060304E8 5.5 h 45.0 s 84.4% 6.6 h 61.6 m
    INFO 14:45:04,323 ProgressMeter - Chr00:528086698 4.41760311E8 5.5 h 45.0 s 84.5% 6.6 h 61.0 m
    INFO 14:45:34,324 ProgressMeter - Chr00:531814739 4.42260319E8 5.6 h 45.0 s 84.6% 6.6 h 60.6 m
    INFO 14:46:04,347 ProgressMeter - Chr00:536621269 4.42960327E8 5.6 h 45.0 s 84.8% 6.6 h 60.1 m

    ERROR ------------------------------------------------------------------------------------------
    ERROR A BAM/CRAM ERROR has occurred (version 3.5-0-g36282e4):
    ERROR
    ERROR This means that there is something wrong with the BAM/CRAM file(s) you provided.
    ERROR The error message below tells you what is the problem.
    ERROR
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR Please do NOT post this error to the GATK forum until you have followed these instructions:
    ERROR - Make sure that your BAM file is well-formed by running Picard's validator on it
    ERROR (see http://picard.sourceforge.net/command-line-overview.shtml#ValidateSamFile for details)
    ERROR - Ensure that your BAM index is not corrupted: delete the current one and regenerate it with 'samtools index'
    ERROR - Ensure that your CRAM index is not corrupted: delete the current one and regenerate it with
    ERROR 'java -jar cramtools-3.0.jar index --bam-style-index --input-file --reference-fasta-file '
    ERROR (see https://github.com/enasequence/cramtools/tree/v3.0 for details)
    ERROR
    ERROR MESSAGE: Exception when processing alignment for BAM index ST-E00200:258:HYT3GCCXX:8:2101:22516:66267 2/2 150b aligned read.
    ERROR ------------------------------------------------------------------------------------------

    If we manually split the large chromosome into smaller contigs, the tool is able to call variants throughout the entire region, so we think there might be a bug somewhere.

  • 570932004570932004 ChinaMember

    We've been having the same problem.
    An error occurs whenever we try to run realign or call variant. Everything goes well until the tool reaches around 512M-th base of a chromosome (Usually somewhere around ChrNN : 536,6**,***) .

    Here is what the error looks like:
    INFO 14:43:34,320 ProgressMeter - Chr00:515278293 4.40060294E8 5.5 h 45.0 s 84.1% 6.6 h 62.4 m
    INFO 14:44:04,321 ProgressMeter - Chr00:518557516 4.40460298E8 5.5 h 45.0 s 84.2% 6.6 h 62.1 m
    INFO 14:44:34,322 ProgressMeter - Chr00:522748942 4.41060304E8 5.5 h 45.0 s 84.4% 6.6 h 61.6 m
    INFO 14:45:04,323 ProgressMeter - Chr00:528086698 4.41760311E8 5.5 h 45.0 s 84.5% 6.6 h 61.0 m
    INFO 14:45:34,324 ProgressMeter - Chr00:531814739 4.42260319E8 5.6 h 45.0 s 84.6% 6.6 h 60.6 m
    INFO 14:46:04,347 ProgressMeter - Chr00:536621269 4.42960327E8 5.6 h 45.0 s 84.8% 6.6 h 60.1 m

    ERROR ------------------------------------------------------------------------------------------
    ERROR A BAM/CRAM ERROR has occurred (version 3.5-0-g36282e4):
    ERROR
    ERROR This means that there is something wrong with the BAM/CRAM file(s) you provided.
    ERROR The error message below tells you what is the problem.
    ERROR
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR Please do NOT post this error to the GATK forum until you have followed these instructions:
    ERROR - Make sure that your BAM file is well-formed by running Picard's validator on it
    ERROR (see http://picard.sourceforge.net/command-line-overview.shtml#ValidateSamFile for details)
    ERROR - Ensure that your BAM index is not corrupted: delete the current one and regenerate it with 'samtools index'
    ERROR - Ensure that your CRAM index is not corrupted: delete the current one and regenerate it with
    ERROR 'java -jar cramtools-3.0.jar index --bam-style-index --input-file --reference-fasta-file '
    ERROR (see https://github.com/enasequence/cramtools/tree/v3.0 for details)
    ERROR
    ERROR MESSAGE: Exception when processing alignment for BAM index ST-E00200:258:HYT3GCCXX:8:2101:22516:66267 2/2 150b aligned read.
    ERROR ------------------------------------------------------------------------------------------

    If we manually split the large chromosome into smaller contigs, the tool is able to call variants throughout the entire region, so we think there might be a bug somewhere.

  • 570932004570932004 ChinaMember

    We've been having the same problem.
    An error occurs whenever we try to run realign or call variant. Everything goes well until the tool reaches around 512M-th base of a chromosome (Usually somewhere around ChrNN : 536,6**,***) .

    Here is what the error looks like:
    INFO 14:43:34,320 ProgressMeter - Chr00:515278293 4.40060294E8 5.5 h 45.0 s 84.1% 6.6 h 62.4 m
    INFO 14:44:04,321 ProgressMeter - Chr00:518557516 4.40460298E8 5.5 h 45.0 s 84.2% 6.6 h 62.1 m
    INFO 14:44:34,322 ProgressMeter - Chr00:522748942 4.41060304E8 5.5 h 45.0 s 84.4% 6.6 h 61.6 m
    INFO 14:45:04,323 ProgressMeter - Chr00:528086698 4.41760311E8 5.5 h 45.0 s 84.5% 6.6 h 61.0 m
    INFO 14:45:34,324 ProgressMeter - Chr00:531814739 4.42260319E8 5.6 h 45.0 s 84.6% 6.6 h 60.6 m
    INFO 14:46:04,347 ProgressMeter - Chr00:536621269 4.42960327E8 5.6 h 45.0 s 84.8% 6.6 h 60.1 m

    ERROR ------------------------------------------------------------------------------------------
    ERROR A BAM/CRAM ERROR has occurred (version 3.5-0-g36282e4):
    ERROR
    ERROR This means that there is something wrong with the BAM/CRAM file(s) you provided.
    ERROR The error message below tells you what is the problem.
    ERROR
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR Please do NOT post this error to the GATK forum until you have followed these instructions:
    ERROR - Make sure that your BAM file is well-formed by running Picard's validator on it
    ERROR (see http://picard.sourceforge.net/command-line-overview.shtml#ValidateSamFile for details)
    ERROR - Ensure that your BAM index is not corrupted: delete the current one and regenerate it with 'samtools index'
    ERROR - Ensure that your CRAM index is not corrupted: delete the current one and regenerate it with
    ERROR 'java -jar cramtools-3.0.jar index --bam-style-index --input-file --reference-fasta-file '
    ERROR (see https://github.com/enasequence/cramtools/tree/v3.0 for details)
    ERROR
    ERROR MESSAGE: Exception when processing alignment for BAM index ST-E00200:258:HYT3GCCXX:8:2101:22516:66267 2/2 150b aligned read.
    ERROR ------------------------------------------------------------------------------------------

    If we manually split the large chromosome into smaller contigs, the tool is able to call variants throughout the entire region, so we think there might be a bug somewhere.

Sign In or Register to comment.