We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

GATK IndelRealigner removing too many reads

by0by0 UKMember


I'm quite new to SNP calling. I am trying to setup a pipeline which includes GATK IndelRealigner as a final step. My bam file (before realignment) is a little over 1GB. After running the indel realigner however, it's reduced to 18MB! I'm assuming its throwing out way too many reads or something has gone wrong.

I'm calling the indel realigner with the default options as follows:

java -Xmx16g -jar $GATK_DIR/GenomeAnalysisTK.jar \
   -T IndelRealigner \
   -R /path/to/my/ref \
   -I input.bam.intervals \
   -targetIntervals input.bam.intervals \
   -o realn.bam \

I am generating the read groups using AddOrReplaceReadGroups.jar (from picard tools) and interval file using GATK RealignerTargetCreator with default options.

My bam file was generated off the raw reads of experiment SRA181417 fetched from SRA (after cleaning adapters using cutadapt, mapping to reference using bwa-mem, and removing duplicate reads using picard tools)

I have tried this on other reads and do not have the same issue.
Can anyone comment on why indel realigner could be throwing out so many reads.

Thank you


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Can you post the full console log output?

  • by0by0 UKMember

    Ah. I just realised there was an error in the output, with the error message being: We encountered a non-standard non-IUPAC base in the provided reference: '10'
    I would think SRA reads shouldn't contain non-IUPAC bases.
    Any idea?


    INFO  14:41:02,980 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-g37228af, Compiled 2014/10/24 01:07:22 
    INFO  14:41:02,981 HelpFormatter - Copyright (c) 2010 The Broad Institute 
    INFO  14:41:02,981 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
    INFO  14:41:02,985 HelpFormatter - Program Args: -T IndelRealigner -R /reference/ref.fa -I ./aln3_readgroups.bam -targetIntervals ./aln3_readgroups.bam.intervals -o ./aln4_realn.bam 
    INFO  14:41:02,990 HelpFormatter - Executing on Linux 2.6.32-504.1.3.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_76-b13. 
    INFO  14:41:02,990 HelpFormatter - Date/Time: 2015/05/19 14:41:02 
    INFO  14:41:02,990 HelpFormatter - -------------------------------------------------------------------------------- 
    INFO  14:41:02,991 HelpFormatter - -------------------------------------------------------------------------------- 
    INFO  14:41:03,540 GenomeAnalysisEngine - Strictness is SILENT 
    INFO  14:41:03,596 GenomeAnalysisEngine - Downsampling Settings: No downsampling 
    INFO  14:41:03,604 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
    INFO  14:41:03,617 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 
    INFO  14:41:03,688 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files 
    INFO  14:41:03,693 GenomeAnalysisEngine - Done preparing for traversal 
    INFO  14:41:03,693 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining 
    INFO  14:41:03,694 ProgressMeter -        Location |     reads | elapsed |     reads | completed | runtime |   runtime 
    INFO  14:41:03,842 ReadShardBalancer$1 - Loading BAM index data 
    INFO  14:41:03,843 ReadShardBalancer$1 - Done loading BAM index data 
    INFO  14:41:18,190 GATKRunReport - Uploaded run statistics report to AWS S3 
    ##### ERROR ------------------------------------------------------------------------------------------
    ##### ERROR A USER ERROR has occurred (version 3.3-0-g37228af): 
    ##### ERROR
    ##### ERROR This means that one or more arguments or inputs in your command are incorrect.
    ##### ERROR The error message below tells you what is the problem.
    ##### ERROR
    ##### ERROR If the problem is an invalid argument, please check the online documentation guide
    ##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
    ##### ERROR
    ##### ERROR Visit our website and forum for extensive documentation and answers to 
    ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ##### ERROR
    ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
    ##### ERROR
    ##### ERROR MESSAGE: Bad input: We encountered a non-standard non-IUPAC base in the provided reference: '10'
    ##### ERROR ------------------------------------------------------------------------------------------
    [W::bam_hdr_read] EOF marker is absent. The input is probably truncated.
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Ah, this is a classic. Your reads are fine, but your reference file is malformed. It was probably stored/edited on a Windows machine at some point, and a Windows-specific character was introduced (the '10' that the error refers to is Windows encoding for new line iirc). On a Unix platform this is then interpreted incorrectly and so the tool fails. Just fix the file encoding or re-download an original copy if the reference and you'll be fine.

  • by0by0 UKMember
    edited May 2015

    Thank you for your reply Geraldine.

    That's odd. My reference / all indexing was done on Linux (Red Hat). My chromosome fasta file comes from this file (at the end)

    Do you see anything wrong with it?

    Here is how I'm indexing.

    # Path to reference fasta (extracted from gff)
    # Generate bwa index files for use with samtools
    bwa index -a bwtsw $FA_PATH 
    # Generate .fai file for use with gatk 
    samtools faidx $FA_PATH
    # Create reference .dict for GATK
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Can't look at the file right now (on smartphone) but FYI we've also seen this happen when the sequence is all on one line or when there is an empty line in the sequence. Do a search (search box, top right corner of page) for the error message to see how people have dealt with this in the past.

Sign In or Register to comment.