bam is not indexed or bad input

Depending on the gatk input file, I get 2 different error messages when I run gatk

If I use the output of picard markedduplicate, I get error message on unindexed bam file whereas the bam file is already indexed as it is already generated by picard samsort before invoking picard markedduplicate. bai file exist, too.

And if I use the output of picard sortsam directly, I get
ERROR MESSAGE: Bad input: We encountered a non-standard non-IUPAC base in the provided reference: '10'

java -jar SortSam.jar SO=coordinate INPUT=~/NGS/data/SRR062641.filt.sam OUTPUT=~/NGS/data/SRR062641.filt.bam VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true

~/NGS/pgm/GenomeAnalysisTK-2.4-9-g532efad$ java -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R /home/carolw/NGS/hg19/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa -o ~/NGS/data/SRR062641.filt.bam.list -I ~/NGS/data/SRR062641.filt.bam

ERROR MESSAGE: Bad input: We encountered a non-standard non-IUPAC base in the provided reference: '10'

java -jar MarkDuplicates.jar INPUT=~/NGS/data/SRR062641.filt.bam OUTPUT=~/NGS/data/SRR062641.filt.marked.bam METRICS_FILE=metrics VALIDATION_STRINGENCY=LENIENT

java -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R /home/carolw/NGS/hg19/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa -o ~/NGS/data/SRR062641.filt.bam.list -I ~/NGS/data/SRR062641.filt.marked.bam

ERROR MESSAGE: Invalid command line: Cannot process the provided BAM file(s) because they were not indexed. The GATK does offer limited processing of unindexed BAMs in --unsafe mode, but this GATK feature is currently unsupported.

Best Answers


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    I think your first issue is because Picard doesn't produce an index for the file it creates with the marked duplicates. You need to index that file explicitly. No need to index at the previous step.

    Once you do that you're going to run into the second issue again. We've seen that happen with folks who work on Windows or whose reference was stored on a windows server. Windows insert line break characters that break the file loading.

  • CarolCarol Member

    But in both cases, create_index = true. So picard should have indexed the file. Is this option is not enough?
    Note that I work under linux.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    In the MarkDuplicates command you posted, create_index is not included.

    If you did include it in the command you actually ran, then it should have. You should check that the filename of the index matches the bam file perfectly (apart from the .bai part). If you renamed it at all you also have to rename the index file.

  • CarolCarol Member

    yes, but to be sure, I ran it again after invoking picard with markduplicates and create_index = true. And it was in the samsort command. I got the same error msg
    ERROR MESSAGE: Bad input: We encountered a non-standard non-IUPAC base in the provided reference: '10'

    here is the full output:
    INFO 19:11:51,549 HelpFormatter - --------------------------------------------------------------------------------
    INFO 19:11:51,553 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.4-9-g532efad, Compiled 2013/03/19 07:35:36
    INFO 19:11:51,553 HelpFormatter - Copyright (c) 2010 The Broad Institute
    INFO 19:11:51,553 HelpFormatter - For support and documentation go to
    INFO 19:11:51,559 HelpFormatter - Program Args: -T RealignerTargetCreator -R /home/carolw/NGS/hg19/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa -o /home/carolw/NGS/data/SRR062641.filt.bam.list -I /home/carolw/NGS/data/SRR062641.filt.marked.bam
    INFO 19:11:51,559 HelpFormatter - Date/Time: 2013/05/09 19:11:51
    INFO 19:11:51,559 HelpFormatter - --------------------------------------------------------------------------------
    INFO 19:11:51,559 HelpFormatter - --------------------------------------------------------------------------------
    INFO 19:11:51,691 GenomeAnalysisEngine - Strictness is SILENT
    INFO 19:11:51,911 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
    INFO 19:11:51,916 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
    INFO 19:11:51,953 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01
    INFO 19:11:52,096 GenomeAnalysisEngine - Creating shard strategy for 1 BAM files
    INFO 19:11:52,482 GenomeAnalysisEngine - Done creating shard strategy
    INFO 19:11:52,483 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining
    INFO 19:12:22,486 ProgressMeter - chr1:173899709 1.74e+08 30.0 s 0.0 s 5.6% 8.9 m 8.4 m
    INFO 19:12:52,487 ProgressMeter - chr2:105725917 3.55e+08 60.0 s 0.0 s 11.5% 8.7 m 7.7 m
    INFO 19:13:22,488 ProgressMeter - chr3:38617269 5.31e+08 90.0 s 0.0 s 17.2% 8.7 m 7.2 m
    INFO 19:13:52,489 ProgressMeter - chr4:20909585 7.11e+08 120.0 s 0.0 s 23.0% 8.7 m 6.7 m
    INFO 19:14:22,491 ProgressMeter - chr5:14434221 8.96e+08 2.5 m 0.0 s 28.9% 8.6 m 6.1 m
    INFO 19:14:52,492 ProgressMeter - chr6:16942473 1.08e+09 3.0 m 0.0 s 34.9% 8.6 m 5.6 m
    INFO 19:15:22,492 ProgressMeter - chr7:32205961 1.27e+09 3.5 m 0.0 s 40.9% 8.6 m 5.1 m
    INFO 19:15:52,493 ProgressMeter - chr8:65355709 1.46e+09 4.0 m 0.0 s 47.1% 8.5 m 4.5 m
    INFO 19:16:22,494 ProgressMeter - chr9:111544989 1.65e+09 4.5 m 0.0 s 53.3% 8.4 m 3.9 m
    INFO 19:16:52,495 ProgressMeter - chr11:29817097 1.85e+09 5.0 m 0.0 s 59.6% 8.4 m 3.4 m
    INFO 19:17:22,496 ProgressMeter - chr12:87073909 2.04e+09 5.5 m 0.0 s 65.8% 8.4 m 2.9 m
    INFO 19:17:52,497 ProgressMeter - chr14:30241797 2.23e+09 6.0 m 0.0 s 72.0% 8.3 m 2.3 m
    INFO 19:18:22,498 ProgressMeter - chr16:13391029 2.42e+09 6.5 m 0.0 s 78.3% 8.3 m 108.0 s
    INFO 19:18:52,514 ProgressMeter - chr18:35420253 2.62e+09 7.0 m 0.0 s 84.5% 8.3 m 76.0 s
    INFO 19:19:22,515 ProgressMeter - chr21:27683077 2.81e+09 7.5 m 0.0 s 90.7% 8.3 m 45.0 s
    INFO 19:19:52,516 ProgressMeter - chrX:122471829 3.00e+09 8.0 m 0.0 s 97.0% 8.2 m 14.0 s
    INFO 19:20:09,161 GATKRunReport - Uploaded run statistics report to AWS S3

    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version 2.4-9-g532efad):
    ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
    ERROR Please do not post this error to the GATK forum
    ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions
    ERROR MESSAGE: Bad input: We encountered a non-standard non-IUPAC base in the provided reference: '10'
    ERROR ------------------------------------------------------------------------------------------
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Yes, this means that your index is now recognized and the GATK is attempting to proceed with the analysis. However, it is finding this bad character in the sequence. As I said in my first reply, that is a problem we have seen with people who use GATK on Windows or who store the reference file on a Windows server. You should re-download the reference from the original source or from out resource bundle.

  • CarolCarol Member

    I download the ref file from the cufflinks web site and since it's a tar.gz, it must be unix file and not windows, no?

  • CarolCarol Member

    I downloaded the ref file from your bundle and it works. Many thanks!

    Now I applied gatk for local realignment around indels and would like to know if I should ignore the reads that failed to be processed by DuplicateReadFilter, MappingQualityZeroFilter, UnmappedReadFilter:

    java -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R ~/NGS/hg19/ucsc.hg19.fasta -o ~/NGS/data/SRR062641.bam.list -I ~/NGS/data/SRR062641.filt.marked.bam

    INFO 14:11:00,816 MicroScheduler - 4890 reads were filtered out during traversal out of 96739 total (5.05%)
    INFO 14:11:00,816 MicroScheduler - -> 351 reads (0.36% of total) failing DuplicateReadFilter
    INFO 14:11:00,816 MicroScheduler - -> 4538 reads (4.69% of total) failing MappingQualityZeroFilter
    INFO 14:11:00,825 MicroScheduler - -> 1 reads (0.00% of total) failing UnmappedReadFilter

Sign In or Register to comment.