To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

error with GenotypeGVCFs

blueskypyblueskypy Member
edited April 2014 in Ask the GATK team

commands are:

java -Xmx10g -jar $gatk -T HaplotypeCaller \
 -R $refGenome \
 --dbsnp $dbSNP \
 -o s1.raw.var.g.vcf.gz \
 -I s1.bam \
 -pairHMM VECTOR_LOGLESS_CACHING \
 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 &&

java -Xmx10g -jar $gatk -T GenotypeGVCFs  \
 -R $refGenome \
 --variant s1.raw.var.g.vcf.gz \
 -o s1.raw.var.vcf.gz

Error in GenotypeGVCFs :

ERROR MESSAGE: Line 115: there aren't enough columns for line �������uN����ޫrf�ebh�����NH(����ٻat� �E�;J1����X��ʩá���Ç�����>��@��,0��' (we

expected 9 tokens, and saw 2 ), for input source: s1.raw.var.g.vcf.gz

Tagged:

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hey @blueskypy‌, can you give a little more context? Is this happening only for one file or all of them that you try to run on? And which version are you running with? (I'm sure you've told me before but I don't remember from one question to the next)

  • blueskypyblueskypy Member
    edited April 2014

    hi, Geraldine,
    I just came back from France! I just run two samples and both had similar errors. Haven't tried more samples yet. It's v3.1

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Welcome back, hope the French treated you well.

    It looks like we're not reading gzipped files correctly -- I know there's some fixing going on internally in our team and in Picard (since we use a lot of their code for I/O operations), let me check the status and get back to you. In the meantime you can check the latest nightly and see if that works properly -- I saw a gzip/tabix fix go in recently, that may just solve your issue.

  • ebanksebanks Broad InstituteMember, Broadie, Dev

    Or you can re-index the file directly with Tabix for now

  • I checked GenomeAnalysisTK-nightly-2014-04-02-g4016f99.tar.bz2, but it didn't solve the problem.

    yes, I'll use tabix for now. Just one question, will GATK automatically recognize x.gz.tbi as the index file for x.gz? or I have to change the x.gz.tbi to x.gz.idx?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Right, it's an indexing problem, so the new version wouldn't fix the existing bad indices; it would create good indices in future so you don't have to re-index things manually.

    Yes, GATK should recognize the .tbi indices.

  • meharmehar Member

    Dear Geraldine,

    I have a similar error with GenotypeVCFs using (GATK) v3.3-0-g37228af. Below is the command used:

    java -Xmx10G -jar GATK-3.3//GenomeAnalysisTK.jar -R canFam3.fa -T GenotypeGVCFs --variant GATK.g.vcf -o  Gatk.vcf
    

    I have used g.vcf file generated by GATK in the previous step and ended up with this error. I have run over 100 samples through the same pipeline and 4 samples are caught up with this error.

    Could you help. Thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @mehar please post the exact error text you are getting.

  • aurafauraf Member
    edited November 2017

    Hi, I am similarly using GenotypeGVCFs on some g.vcf files (not compressed), and I am obtaining a similar error:

    ERROR MESSAGE: Line 3599039: there aren't enough columns for line �T&2Qo���128��ߟ� ��0��ߟ���ߟ���ߟ���ߟ���ߟ�d�<.����H�ߟ�sA������ߟ� �ߟ� �ߟ��E.����ߟ���ߟ� �ߟ��e����� �ߟ� �ߟ� �ߟ� �ߟ�*�ߟ��������� �ߟ�������������@�Q�� �ߟ� �ߟ�T�ߟ�x���00��ߟ���ߟ��������2��(������c5Y]/�)F��V���W-p�$172.20.0.8 �ߟ�P�ߟ���ߟ���)���� ��������N���+�����ߟ�Z����������ߟ�Z ���0 �������22Z�����4��0��ߟ���ߟ�R���0�ߟ���ߟ���ߟ��������ߟ��������ߟ��ߟ���ߟ�#ATSIGN DEFAULT="" OVERRIDE=\@ (we expected 9 tokens, and saw 4 ), for input source: group12.g.vcf

    Reading the comments above I tried to re-index the g.vcf file using tabix and these commands:

    bgzip -c group12.g.vcf > group12.g.vcf.gz
    tabix -fp vcf group12.g.vcf.gz

    However I still obtain an error:
    [E::get_intv] failed to parse TBX_VCF, was wrong -p [type] used?
    The offending line was: "Û"
    [E::hts_idx_push] unsorted positions on sequence #10: 115372190 followed by 1
    tbx_index_build failed: group12.g.vcf.gz

    The file group12.g.vcf has been generated with the GATK pipeline (last command on it before this step was CombineGVCFs)

    Thank you in advance for any possible suggestion.

  • SkyWarriorSkyWarrior TurkeyMember
    edited November 2017

    your bgzip command seems to be off. bgzip does no write to stdout.

    can you bgzip your g.vcf file like below

    bgzip group12.g.vcf

    tabix -fp vcf group12.g.vcf.gz

    Also it is better to upgrade your htslib if the version is too old.

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @auraf,

    This thread is dated and these types of bugs should be fixed by now for later versions of GATK. Can you tell us the version of GATK you are using and the exact command that generated the file causing the error? GATK tools, when writing a VCF, should generate a VCF index automatically.

  • Hi @shlee , sorry for the late response. I am using GATK v3.6, using the GenotypeGVCFs tool.

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @auraf,

    I know GATK4 is still in beta but I recommend IndexFeatureFile to index your VCF.

    You should also check to make sure your GVCF passes ValidateVariants.

Sign In or Register to comment.