The current GATK version is 3.8-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Get notifications!


You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Got a problem?


1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

Did we ask for a bug report?


Then follow instructions in Article#1894.

Formatting tip!


Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block as demonstrated here.

Jump to another community
Download the latest Picard release at https://github.com/broadinstitute/picard/releases.
GATK version 4.beta.3 (i.e. the third beta release) is out. See the GATK4 beta page for download and details.

ERROR MESSAGE: Bad input:We encountered a non-standard non-IUPAC base in the provided reference: '13

Hi,
I'm currently working with bwa, samtools and GATK to make SNP calling on Medicago truncatula. I'm using my own reference sequence, with the 8 chromosoms in the same fasta file.

C1_lenght=155648

AAAGATAGAGA..

C2_lenght=125018

ATGGATC...
etc..
I have done alignments without problem, but for GATK : I do rmdup --> CreateSequenceDictionary.jar (picard) --> samtools sort --> Read Group (picard) --> samtools index and then :
Pre alignment with :

java -jar -Xmx4g /usr/local/bioinfo/src/GATK/GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar -nt 8 -T RealignerTargetCreator -R REF.fa -o RTC.intervals -I INPUT_muq30_RMDUP_RG.bam

Here there is no problem, but when I want to make the realignement :

java -jar -Xmx4g /usr/local/bioinfo/src/GATK/GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar -T IndelRealigner -R REF.fa -I INPUT_muq30_RMDUP_RG.bam -targetIntervals RTC.intervals -o INPUT_muq30_RMDUP_RG_REAL.bam

And I got this error message :
ERROR MESSAGE: Bad input:We encountered a non-standard non-IUPAC base in the provided reference: '13'

I didn't find any explanation in google for this error.
Could you please help me ?!

vschilling

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Have you seen the FAQ article about inputs?

    This could be an encoding issue, if you're sure that there are no non-IUPAC bases in your reference.

  • vschillvschill Member
    edited July 2013

    *Hi,
    I have seen the FAQ article about inputs. I'm sure that there are no non-IUPAC bases in my reference. My last message was not well transcripted, my reference fasta file is like this :

    '>C1_lenght155648

    AAAGATAGAGAATCGCTAGCTC

    CGCTAGCTCGCATATAGAGATAG
    ......

    '>C2_lenght28648

    ATTTCGCTCCCGATAAGATACTC

    CGCTCGCGCTCGAAAGCTCGA
    ......

    Is that correctly written? Would you have some explanation for this error message?
    Thanks in advance,

    Vincent

  • Hi,
    thanks for answer. I have opened my fasta file into gedit and saved it with UTF-8 format. I'm going to check it.

    Vschilling

  • KStammKStamm Member
    edited August 2013

    Same problem has just cropped up for me today. Running the "BaseRecalibrator" from git/gatk-04-18-g2fd787a/GenomeAnalysisTK.jar gave the error "ERROR MESSAGE: Bad input:We encountered a non-standard non-IUPAC base in the provided reference: '10"

    It's strange because I have used this pipeline on previous samples successfully. In fact it's the same reference fasta I've used for alignment and other BaseRecalibration many times in the past. The task even appears to complete, running right past Chr10 and through ChrX then MT before reporting the error. Then no output is produced and I only noticed when downstream processes reported missing input files.

    If the error message would have told me what the problem was, or which line or offset, or even what the non-IUPAC code was, I could search for it. It's not impossible for my reference fasta to have become corrupted somehow; in fact that's the only explanation I can come up with given it has worked in the past. Now I'm searching through chr10 manually looking for anything out of place but that's not fruitful.

    Now I'm looking for FASTA validator tools I could run on the reference; but it's very strange for bwa to have succeeded.

    ----- edit update:

    I found the GATK now has a tool called QCRef to actually QC a reference file. Running that on my suspect fasta yeilds the same results. It processes happily through all chromosomes, then at the end it reported

      ##### ERROR MESSAGE: Bad input: We encountered a non-standard non-IUPAC base in the provided reference: '10'
    

    Now I'm at a loss. I guess my fasta has become corrupted and have no way of knowing where in the file the error was found, where in the chromosome, what the error was or how it got there. I don't want to download a new one at risk of any updates invalidating all previous work before this occurred (all previous alignments and variant calls would have to be rerun on the updated reference). What other information can I get about this error?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Where did you originally get your reference from? If it's a standard one you needn't worry about re-downloading it, it won't mess up your previous work.

    If you want to troubleshoot this problem you can start by checking the encoding of your reference file.

  • KStammKStamm Member
    edited August 2013

    I use the GATK/bundle/1.2/human_g1k_v37.fasta which I suppose would be stable and redownloadable. I'll do that and md5sum to check for divergence.

    Line endings are \n only.

    Ive been using this file for a long time. The last batch of samples ran with this same reference and GATK version in July. That's what is really concerning me. It's therefore either a random data corruption or perhaps an updated system library my administrator didn't tell me about. I have come to expect each new version of GATK to have some new file format expectations and errors but that's why I stick to aging versions until a showstopping bug forces me to upgrade. (that put me up to nightly gatk-04-18-g2fd787a). So I know this version of GATK/reference had worked in the past.

    It's worth a try to redownload and checksum. If it turns out to be the problem, then we've got some more serious problems.

    ----edit UPDATE

    I don't see bundle version 1.2 still up on the FTP, but bundle 2.6 has a file of name human_g1k_v37.fasta I will compare to my local copy. The MD5 does not match mine, but perhaps I have renamed a chromosome. Now running a diff to find what is changed.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    OK -- I believe we have always used the same human_g1k_v37.fasta so the bundle copy should be exactly the same. Good luck!

  • KStammKStamm Member
    edited August 2013

    After checking the two reference fastas for any differences I think I've found the problem. The GATK bundle human_g1k_v37.fasta has a blank line (extra \n) between chrs MT and GL000207 (at line number 51594898). Some tool had complained about this invalid line (wish I remembered which tool!) so I removed the empty line.

    A fasta.index was rebuilt but the fasta.fai was ignored and became outdated. Now I guess the error here is the single character being removed would cause the .fai to be offset. Now when the BaseRecalibrator task tries to grasp a small portion of the GL000207 chromosome it would get something that bridges an otherwise valid single \n. Hence the error message about character '10'.

    I'm running this step on the clean version to check that this was truly the problem and to rule out other system changes.

    Post edited by KStamm on
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hmm, interesting. Thanks for reporting back with this info, it might be useful to other users. Please do confirm whether this does indeed solve the problem.

    FYI I think the Fasta format specification does allow blank lines as long as records are formatted correctly; it sounds like it's whatever the tool that complained that's not able to cope with them. If you ever find which tool that was it might be useful to tell the developers that.

  • Yup everything seems to be working now.

    The problem was a fasta index being outdated after I manually edited the genome reference to remove a single empty line between chromosomes, causing a cascade of off-by-one errors. It wasn't noticed by most tools because it occurred only in the auxiliary GL* chromosomes. The tool that refused to operate across the empty newline was Ensembl's Variant Effect Predictor, a perl script that calculates the coding-sequence impacts of a VCF with respect to a reference FASTA.

    Therefore I introduced the problem after the previous batch of samples had finished their GATK processing and only now have new samples shown the error in the reference's index.

    The good news is tracking down the problem has resolved my fears of unknown software updates or random data corruption or even erroneous sample processing; everything should be okay now and work doesn't have to be redone.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Glad to hear it! And thank you for summarizing the issue & resolution.

  • JulsJuls Member

    Hi,

    I get a similar error when running the following command:

    java7 -jar /path2gatk/GenomeAnalysisTK-2.6-4-g3e5ff60//GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fasta -I file.bam -o output.vcf -nct 25 --minPruning 5 -dcov 200 -gt_mode DISCOVERY -out_mode EMIT_VARIANTS_ONLY -stand_emit_conf 10 -stand_call_conf 30

    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version 2.6-4-g3e5ff60):
    ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
    ERROR Please do not post this error to the GATK forum
    ERROR
    ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: Bad input: We encountered a non-standard non-IUPAC base in the provided reference: '0'
    ERROR ------------------------------------------------------------------------------------------

    The reference file has the correct encoding. I have exported the file from the CLC workbench - there shouldn't be any non-IUPAC chars in there. I've also run the local realignment without any trouble with this reference - this just showed up when running the HaplotypeCaller. And the error pops up randomly (not always at the same scaffold as shown by the progressmeter)

    I am lost here. I hope you can help me!
    Thank you

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    The randomness of the error is likely due to the fact that the error only occurs when the positions involved are actually used by the program, and certain processing steps don't necessary act on all positions.

    I expect that the issue was introduced by CLC workbench. Is it a custom reference?

  • ersenkavakersenkavak TurkeyMember

    I had the same problem and reindexing the reference fasta file resolved the issue.

  • jkominekjkominek BelgiumMember

    I just had the same problem, so I want to thank everybody on the topic for suggestions. I wanted to add that sometimes reindexing alone isn't enough to make the error go away. You might also need to "normalize" your reference FASTA file (using "picard NormalizeFasta") before reindexing, even if your reference file looks perfectly ordinary.

  • lawallawal United KingdomMember

    @All, just to contribute to the post. I got this error message

    Bad input: We encountered a non-standard non-IUPAC base in the provided reference: '10'

    While going through this page, i found an idea where the problem may be coming from. I realized that after creating the .fa.fai, .dict and so on, I later changed the header of my fasta file to a name completely different from the .fa.fai. This actually created the problem. So what i did was to rename it to the one that matches other extensions earlier created from the fasta file and guess what...BINGO!!!

  • fabiodpfabiodp Padova, ItalyMember

    Hi All,
    Is there any update about this issue?
    I had the same problem in running HaplotypeCaller from GATK 3.4-46 on exome sequencing data with a standard version of hg37 as reference. The strange thing is that in my lab we always use this fasta file for many other studies and analyses with GATK tools and we never find this error before. I checked it for blank lines or spaces but none were found, I also checked the date of fasta file and relative indexes and all the indexes are younger than the reference and no modifications were done.
    Moreover, I tried to run the HaplotypeCaller on the same dataset twice and it stacked in different point but with this same error:

    Bad input: We encountered a non-standard non-IUPAC base in the provided reference: '10'

    Thanks for any help.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    @fabiodp did you try any of the solutions other people have proposed above?
  • fabiodpfabiodp Padova, ItalyMember

    @Geraldine_VdAuwera
    I searched the reference for blank spaces or empty lines but non were found. And the indexes were ok with the reference.
    Now the thing is strange because I had a couple of samples that I am analyzing with HaplotypeCaller. The error appeared when I run simultaneously the analysis for all the samples. After that, I tried to launch one sample after the other only when the analysis is completed: no error issue for any sample. This make me think that there could be a problem in interrogating the same file (the reference) at the same time by different HaplotypeCaller process... I know this might sound strange, this is the only explanation it kept to my mind. Could it be possible?
    Thanks

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @fabiodp
    Hi,

    Can you confirm this happens with the latest version?

    Thanks,
    Sheila

  • bbimberbbimber HomeMember

    Hello,

    I am suddenly seeing this exact same issue using RealignerTargetCreator on v3.6-0-gf185a75. We've run this on previous versions w/ the exactly same FASTA/FAI inputs (and even the same FASTQ input data) before. We have the exact same error on 2 different genomes (different FASTAs) in the last day. The exact error is "Bad input: We encountered a non-standard non-IUPAC base in the provided reference: '10'". Based on these posts I'm going to check the FASTA for \r and double newlines. If there is anything else I should check please let me know. I'll remake the FAI after this if anything changes. Should I be doing anything else?

    Is this a known issue w/ 3.6, or did some parsing change?

  • jule_ilhjule_ilh germanyMember

    Hi,
    I used GATKs RealignerTargetCreator and got the same error message:

    ERROR MESSAGE: Bad input: We encountered a non-standard non-IUPAC base in the provided reference: '13'

    The reference sequence was 'created' with snapgene.

    Solution: I am working in a UNIX environment and I converted the reference file to have unix line breaks:
    mac2unix ref.fa
    And just to make sure, I used the following as well
    dos2unix ref.fa

    Then I created new dictionary and index file and GATKs RealignerTargetCreator worked without no errors.
    So the problem seemed to be the line breaks.

  • bbimberbbimber HomeMember

    FWIW - that doesnt explain ours. our FASTA only contains \n, and did not have carriage returns. it also did not have blank newlines (which was suggested elsewhere as a possible issue).

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @bbimber
    Hi,

    There was an issue with the nightly build system that should be fixed now. Please try again with the latest nightly build, and the issue should go away.

    -Sheila

  • bbimberbbimber HomeMember

    I've been making the JAR by cloning from github master, 3.6 tag and building locally. Does this change that answer at all?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @bbimber Nothing has changed recently that would explain this, as far as I can tell. We recently took in a new version of htsjdk but it seems unlikely that that would be the cause -- anything like that would break our test suite, but all tests are passing normally. I'm also skeptical that compilation would cause this, but you can always test the precompiled version from our website and see if that makes any difference. Ultimately you should test with a fresh copy of the reference file to eliminate a file corruption issue.

  • bbimberbbimber HomeMember

    I'll look at this; however, I'll reemphasize the FASTA file has not changed (and was pretty heavily used for various GATK analyses). The only change was an update to the GATK JAR.

  • Adam_mAdam_m PolandMember

    So could someone paste a link to the 'correct for gatk' reference genome hg19?
    It seems that my is wrong, however it worked well with other tools...

    Ahh, this software is amazing, I did everything as is written in documentation/tutorials and I have error after error :disappointed:
    Fortunately and at least there's usually link to the article with problem explanation and solution.

    I made every of the few solutions above and no one helped, so the problem is with reference propably.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Adam_m
    Hi,

    Can you please post the exact command you ran and the entire log output and error message you get? Which version of GATK are you using? The hg19 reference we provide can be found in our bundle. But, if you used a different version for mapping, you will have to re-do the entire process.

    -Sheila

  • Adam_mAdam_m PolandMember

    Thank you for the answer.

    Message:
    ERROR MESSAGE: Bad input:We encountered a non-standard non-IUPAC base in the provided reference: '10'

    I just downloaded the latest version. Unfortunately I'm wornking on Win XP 32 bit. Moreover I want to make a re-calling variants, I already have bam, vcf, etc. files made by NGS manufacturer staff. To be honest I don't know what tools and commands have been used to obtain bam files etc.

    And one more question. When I make an index file (fai) from my reference genome (hg19), firstly I have problem with HaplotypeCaller - 'wrong symbol has been found in fai file line 17'. I opened it by Notepad++ and I discover that there are minus symbols in 2nd column in fai file. I removed them manually, HaplotypeCaller runs after this modification, nevertheless then mentioned above error occured in ~70% progress. So it might be problem with encoding of reference index

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    I'm afraid we don't support working on Windows. I would recommend looking into a virtual environment or dual boot system to run in a Linux environment instead, otherwise you're going to run into this sort of problem a lot, and we won't be able to help you.
  • bbimberbbimber HomeMember

    I believe my issue is probably a different root cause then the one reported by by @Adam_m, but I thought I'd add that I narrowed down this issue to FASTA files present on our cluster's filesystem (which uses lustre). we never observe this for the identical FASTAs (and associated FAI, DICT) files when being read from other mounted filesystems. We havent had the bandwidth to investigate it much beyond that.

  • shleeshlee CambridgeMember, Broadie, Moderator

    Thanks for the update @bbimber.

  • gulongjianggulongjiang ChinaMember

    i have encounted the same error message as "ERROR MESSAGE: Bad input:We encountered a non-standard non-IUPAC base in the provided reference: '10'.
    Firstly, I used dos2unix command to transfer my genome file into unix file format, and then used "picard CreateSequenceDictionary" to creat new dict file and finally "samtools faidx" command to re-index genome files, and then GATK works!

  • shleeshlee CambridgeMember, Broadie, Moderator

    Thanks for sharing your solution @gulongjiang!

Sign In or Register to comment.