Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

GATK4 haplotype caller crashes on soft clipped reads at very start of a chromosome

WimSWimS Member ✭✭
edited July 2017 in Ask the GATK team

Hi,

I am trying out GATK4 Haplotype caller for 8 samples for a non model organism species with a basic reference genome. Output is gVCF.

For most of the genome intervals GATK4 Haplotype caller produces valid gVCF files.
An an interval on chromosome 4 that starts at location 0 (as defined via bed file) crashed because of soft clipped reads.

The error:

java.lang.IllegalArgumentException: contig must be non-null and not equal to *, and start must be >= 1
        at org.broadinstitute.hellbender.utils.read.SAMRecordToGATKReadAdapter.setPosition(SAMRecordToGATKReadAdapter.java:89)
        at org.broadinstitute.hellbender.utils.clipping.ClippingOp.applyHARDCLIP_BASES(ClippingOp.java:381)
        at org.broadinstitute.hellbender.utils.clipping.ClippingOp.apply(ClippingOp.java:73)
        at org.broadinstitute.hellbender.utils.clipping.ReadClipper.clipRead(ReadClipper.java:147)
        at org.broadinstitute.hellbender.utils.clipping.ReadClipper.clipRead(ReadClipper.java:128)
        at org.broadinstitute.hellbender.utils.clipping.ReadClipper.hardClipSoftClippedBases(ReadClipper.java:332)
        at org.broadinstitute.hellbender.utils.clipping.ReadClipper.hardClipSoftClippedBases(ReadClipper.java:335)
        at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.AssemblyBasedCallerUtils.finalizeRegion(AssemblyBasedCallerUtils.java:84)
        at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.AssemblyBasedCallerUtils.assembleReads(AssemblyBasedCallerUtils.java:238)
        at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCallerEngine.callRegion(HaplotypeCallerEngine.java:480)
        at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller.apply(HaplotypeCaller.java:221)
        at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.processReadShard(AssemblyRegionWalker.java:244)
        at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.traverse(AssemblyRegionWalker.java:217)
        at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:838)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:115)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:170)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:189)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:131)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:152)
        at org.broadinstitute.hellbender.Main.main(Main.java:230)

The bed file:

Chr_04  0       268
Chr_04  2648    4654
Chr_04  4964    98377
Chr_04  102202  169423
Chr_04  169694  180201

I also attached an image showing the location in the BAM file that causes the error.

Could you please fix this error. Thank you.

Post edited by shlee on

Best Answer

  • WimSWimS ✭✭
    Accepted Answer

    I rather not use the -dontUseSoftClippedBases option, I understand from another question it might decrease sensitivity for large indels. Having the NNN stretches and the ends of chromosomes sounds like a good idea. However most references we use are public and of varying quality. We don't have much influence on the creation of these references and would like to stick to the public coordinate system.

    For now I think I'll blacklist the entire Chr_04:0-268 from being variant called, by specifying a blacklist bed file for the reference in bcbio. It also looks like a collapsed repeat. This will hopefully solve the issue.

    Thank you.

Answers

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @WimS,

    java.lang.IllegalArgumentException: contig must be non-null and not equal to *, and start must be >= 1

    I've been told our tools interpret 0-based BED files appropriately, so it's unlikely that this is the cause of your error. We can test for this possibility by using a 1-based intervals list with the command. You can convert your BED file to the 1-based intervals that GATK traditionally expects with Picard BedToIntervalList.

    It appears to me in the IGV screenshot you share, that the reads at top may align beyond the end of reference. Is this the case? Such alignments would have alignment starts that are less than one. If the beyond-end-of-reference portion of alignments are soft-clipped, our tools should handle these reads without error. Otherwise, you will have to fix the alignment in these reads. For this you can use Picard CleanSam.

  • WimSWimS Member ✭✭
    edited July 2017

    Hi @slee . I think the crash is caused because of the soft clipped reads. If I interpret the IGV popup correctly the alignment of the read starts at Chr_04:1 for 119 bases match. But the first 6 bases, before the 119 bp match are soft clipped.

    The GATK4 applyHARDCLIP_BASES calls SAMRecordToGATKReadAdapter.setPosition to update the alignment start position to 1-6= -5 which of course does not make sense. I guess either applyHARDCLIP_BASES or SAMRecordToGATKReadAdapter.setPosition needs a small bit of extra code to prevent this <1 starting position update.

    This probably never happens in test data on the human reference genome because the reference genome is of much higher quality than the resequencing data, then if would never happen that you get a soft clipped read at position 1 of a chromosome.

    I can check tomorrow in the BAM file what the actual BAM records are.

    Previously I have variant called these bam files with Freebayes without any problems. The bam files are created using bwa-mem and a standard re-sequencing workflow in bcbio.

  • WimSWimS Member ✭✭
    edited July 2017

    This could be related to be a known and open bug in GATK4:
    https://github.com/broadinstitute/gatk/issues/3013

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    In this case, @WimS, perhaps HaplotypeCaller's -dontUseSoftClippedBases option could be helpful. I am not all that certain how the tool will interpret the alignments given this option, but it may be worth a try so that you can move forward with calling variants at these edges. I'd like to mention that in the human reference assemblies, the edges are marked by a series of Ns bases and so we do not encounter such issues as yours. It may be worth proposing to your model organism reference consortium that they do something similar for their next major release.

  • WimSWimS Member ✭✭
    Accepted Answer

    I rather not use the -dontUseSoftClippedBases option, I understand from another question it might decrease sensitivity for large indels. Having the NNN stretches and the ends of chromosomes sounds like a good idea. However most references we use are public and of varying quality. We don't have much influence on the creation of these references and would like to stick to the public coordinate system.

    For now I think I'll blacklist the entire Chr_04:0-268 from being variant called, by specifying a blacklist bed file for the reference in bcbio. It also looks like a collapsed repeat. This will hopefully solve the issue.

    Thank you.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @WimS
    Hi,

    Can you confirm your input BAM file validates with Picard's ValidateSamFile?

    Thanks,
    Sheila

  • WimSWimS Member ✭✭

    Hi @Sheila . I needed to specify IGNORE=[MISMATCH_MATE_CIGAR_STRING, MATE_CIGAR_STRING_INVALID_PRESENCE] in order to validate the BAM file with picard.

    Otherwise I got a lot of these errors:

    ERROR: Record 62, Read name DE18INS658:81:C718RANXX:3:2307:12883:19959, Mate CIGAR String (MC Attribute) present for a read whose mate is unmapped
    ERROR: Record 62, Read name DE18INS658:81:C718RANXX:3:2307:12883:19959, Mate CIGAR string does not match CIGAR string of mate
    

    I am not sure these CIGAR issues are the cause of the GATK4 Haplotyper crash, since this cigar issues occur genome wide and the crash only on a specific region of the reference genome.

    These bam files are produced by an older version of bcbio (bcbio-0.9.9) which used bwa + older version of samblaster + bamcat to efficiently create the bam files. The invalid tags are probably caused by the old version of samblaster or bamcat.

    As mentioned in the answer that I accepted blacklisting the region from variant calling fixed the issue for me. Using the latest version of bcbio to run GATK4 also fixed the issue for me, it used a better approach to marking reads in collapsed repeat regions as QC failed. This means that those reads don't even enter GATK4.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    @WimS,

    Picard FixMateInformation can sync mate-pair information including the MC tag.

    For less ideal reference sets, consistently using an intervals list to limit your analyses to regions of interest and to exclude problematic regions is the general approach folks take. With the latest human GRCh38 reference, intervals files for these purposes have not been necessary.

    Are you satisfied with the solution you are using or do you need us to look more into your crashes?

  • WimSWimS Member ✭✭

    Hi @shlee For me this issue can be closed. Black listing the whole region or just marking the collapsed repeat region reads QC failed for me fixes the issue. Those 2 strategies probably also work for all other similar cases.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @WimS
    Hi,

    I see. So, the new versions of the tools fixed the issue. That is good to hear.

    -Sheila

Sign In or Register to comment.