Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

BeagleOutputToVCF Error "Unable to read index file, for input source: .r2"

dear GATK team,

i am trying to convert beagle output files back to vcf format.
but i always get the following error:

ERROR MESSAGE: Unable to read index file, for input source: /Chr25.r2.idx

i have tried it with GATK 2.6-4 and 2.8.1.
my code is as follows:

java -Xmx4g -jar $GATKDir2/GenomeAnalysisTK.jar \
-R $refDir/$reference \
-T BeagleOutputToVCF \
-l INFO \
-V $vcfDir/Chr25-Raw.vcf \
-beaglePhased:BEAGLE $beagleDir/Chr25.phased \
-beagleProbs:BEAGLE $beagleDir/Chr25.gprobs \
-beagleR2:BEAGLE $beagleDir/Chr25.r2 \
-o $beagleDir/Chr25.afterbeagle.vcf

i tried to index the .r2 file with igvtools

java -Xmx4g -jar $igvDir/igvtools.jar index $beagleDir/Chr25.r2
but that does not recognize the file type.
error message:
Unknown File Type

i also tried with an "empty" vcf like this;

fileformat=VCFv4.1

pseudoVCF to run beagleToVCF

CHROM POS ID REF ALT QUAL FILTER INFO

any help highly appreciated!
best regards marlies

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    This might be a bug, as I don't think that BEAGLE produces an index for the r2 file (someone correct me if I'm wrong), so GATK shouldn't be requiring one. Do you see an r2.idx file in Beagle's output?

  • MarDoleMarDole Member

    hi geraldine,
    thanks for the quick reply! to my knowledge beagle does not produce .idx files for any of its output files. the 4 output files i got for each chromosome are .dose, .grprobs, .phased, and .r2.
    thanks
    marlies

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hmm, that's what I thought (I don't work with Beagle myself so I have no first-hand experience). Can you post the full console output from your run of BeagleOutputToVCF?

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    I just want to point out that the latest version of Beagle works directly with VCFs now, so you shouldn't need to use these tools anymore...

  • MarDoleMarDole Member

    hi geraldine and eric,

    as far as i understood beagle v4 can now produce vcf formatted output files. but i already have the beagle v3 output files produced by collaborators on a couple of hundred cattle whole genome resequencing data
    that i need to use, as i do not have the computational resources to redo this task.
    thanks again!
    marlies
    when i run BeagleOutputToVCF the first time the console output reads as follows.

    INFO 16:32:46,766 HelpFormatter - --------------------------------------------------------------------------------
    INFO 16:32:46,781 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.8-1-g932cd3a, Compiled 2013/12/06 16:47:15
    INFO 16:32:46,781 HelpFormatter - Copyright (c) 2010 The Broad Institute
    INFO 16:32:46,781 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
    INFO 16:32:46,787 HelpFormatter - Program Args: -R /home/marlies/cifs/GATK/UMD_3.1.fa -T BeagleOutputToVCF -l INFO -V /home/marlies/rumba/bulls1k/Chr25-Raw.vcf -beaglePhased:BEAGLE /home/marlies/rumba/bulls1k/Chr25.phased -beagleProbs:BEAGLE /home/marlies/rumba/bulls1k/Chr25.gprobs -beagleR2:BEAGLE /home/marlies/rumba/bulls1k/Chr25.r2 -o /home/marlies/rumba/bulls1k/Chr25.afterbeagle.vcf
    INFO 16:32:46,796 HelpFormatter - Date/Time: 2014/01/08 16:32:46
    INFO 16:32:46,796 HelpFormatter - --------------------------------------------------------------------------------
    INFO 16:32:46,796 HelpFormatter - --------------------------------------------------------------------------------
    INFO 16:32:46,839 ArgumentTypeDescriptor - Dynamically determined type of /home/marlies/rumba/bulls1k/Chr25-Raw.vcf to be VCF
    INFO 16:32:47,752 GenomeAnalysisEngine - Strictness is SILENT
    INFO 16:32:47,983 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
    INFO 16:32:48,021 RMDTrackBuilder - Loading Tribble index from disk for file /home/marlies/rumba/bulls1k/Chr25-Raw.vcf
    INFO 16:32:48,249 RMDTrackBuilder - Writing Tribble index to disk for file /home/marlies/rumba/bulls1k/Chr25-Raw.vcf.idx
    WARN 16:32:48,255 RMDTrackBuilder - Unable to update index with the sequence dictionary for file /home/marlies/rumba/bulls1k/Chr25-Raw.vcf.idx; this will not affect your run of the GATK
    INFO 16:32:48,300 RMDTrackBuilder - Creating Tribble index in memory for file /home/marlies/rumba/bulls1k/Chr25.r2
    INFO 16:32:50,152 RMDTrackBuilder - Writing Tribble index to disk for file /home/marlies/rumba/bulls1k/Chr25.r2.idx
    INFO 16:32:52,383 GATKRunReport - Uploaded run statistics report to AWS S3

    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version 2.8-1-g932cd3a):
    ERROR
    ERROR This means that one or more arguments or inputs in your command are incorrect.
    ERROR The error message below tells you what is the problem.
    ERROR
    ERROR If the problem is an invalid argument, please check the online documentation guide
    ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
    ERROR
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
    ERROR
    ERROR MESSAGE: I/O error loading or writing tribble index file for /home/marlies/rumba/bulls1k/Chr25.r2
    ERROR ------------------------------------------------------------------------------------------

    when i run BeagleOutputToVCF a second time the console output reads as follows.

    INFO 16:46:21,602 HelpFormatter - --------------------------------------------------------------------------------
    INFO 16:46:21,613 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.8-1-g932cd3a, Compiled 2013/12/06 16:47:15
    INFO 16:46:21,613 HelpFormatter - Copyright (c) 2010 The Broad Institute
    INFO 16:46:21,613 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
    INFO 16:46:21,629 HelpFormatter - Program Args: -R /home/marlies/cifs/GATK/UMD_3.1.fa -T BeagleOutputToVCF -l INFO -V /home/marlies/rumba/bulls1k/Chr25-Raw.vcf -beaglePhased:BEAGLE /home/marlies/rumba/bulls1k/Chr25.phased -beagleProbs:BEAGLE /home/marlies/rumba/bulls1k/Chr25.gprobs -beagleR2:BEAGLE /home/marlies/rumba/bulls1k/Chr25.r2 -o /home/marlies/rumba/bulls1k/Chr25.afterbeagle.vcf
    INFO 16:46:21,633 HelpFormatter - Date/Time: 2014/01/08 16:46:21
    INFO 16:46:21,633 HelpFormatter - --------------------------------------------------------------------------------
    INFO 16:46:21,633 HelpFormatter - --------------------------------------------------------------------------------
    INFO 16:46:21,697 ArgumentTypeDescriptor - Dynamically determined type of /home/marlies/rumba/bulls1k/Chr25-Raw.vcf to be VCF
    INFO 16:46:22,595 GenomeAnalysisEngine - Strictness is SILENT
    INFO 16:46:22,877 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
    INFO 16:46:22,901 RMDTrackBuilder - Loading Tribble index from disk for file /home/marlies/rumba/bulls1k/Chr25-Raw.vcf
    INFO 16:46:23,135 RMDTrackBuilder - Writing Tribble index to disk for file /home/marlies/rumba/bulls1k/Chr25-Raw.vcf.idx
    WARN 16:46:23,141 RMDTrackBuilder - Unable to update index with the sequence dictionary for file /home/marlies/rumba/bulls1k/Chr25-Raw.vcf.idx; this will not affect your run of the GATK
    INFO 16:46:23,205 RMDTrackBuilder - Loading Tribble index from disk for file /home/marlies/rumba/bulls1k/Chr25.r2
    INFO 16:46:25,767 GATKRunReport - Uploaded run statistics report to AWS S3

    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version 2.8-1-g932cd3a):
    ERROR
    ERROR This means that one or more arguments or inputs in your command are incorrect.
    ERROR The error message below tells you what is the problem.
    ERROR
    ERROR If the problem is an invalid argument, please check the online documentation guide
    ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
    ERROR
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
    ERROR
    ERROR MESSAGE: Unable to read index file, for input source: /home/marlies/rumba/bulls1k/Chr25.r2.idx
    ERROR ------------------------------------------------------------------------------------------
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Ah, that makes more sense now. The program is creating an index for your r2 file at the beginning of the run, but for some reason it the file couldn't be written successfully in your first run. Then in your second run, the program sees there is an index file, but it is incomplete or corrupted and can't be read.

    What is still not clear is why the file creation failed. Based on the earlier warning:

    Unable to update index with the sequence dictionary for file /home/marlies/rumba/bulls1k/Chr25-Raw.vcf.idx

    I'm wondering if maybe there is a permissions error that is preventing GATK from writing any files to disk. Or your disk is full (though I would expect a more specific error to that effect), or your space quota is used up. Yo'll want to check the status of your permissions and space in the directory you're using. Are you on a server or a personal machine?

  • MarDoleMarDole Member

    hi geraldine,

    i am running my scripts from my local machine, but the data are on a micro-server. i have ~8TB of free space on that server. and i have so far not experienced any problems with many other GATK- walkers working in that exact set-up.

    i can index the -V file without troubles with igvtools

    echo "indexing with igvtools"

    java -Xmx4g -jar $igvDir/igvtools.jar index $vcfDir/Chr25-Raw.vcf

    echo " size of index file"

    ls -lh $vcfDir/Chr25-Raw.vcf.idx

    #####result:

    indexing with igvtools

    Done

    size of index file

    -rwxrwxrwx 1 root root 22K Jan 8 18:40 /home/marlies/rumba/bulls1k/Chr25-Raw.vcf.idx

    note the -V vcf file Chr25-Raw.vcf was produced by samtools, so i had a closer look at the header lines and realized that the contig and chromosomal information is lacking in the vcfs produced by samtools.

    so to rule out any incompatibility because of samtools i took different data for which we called SNPs with unified genotyper, produced beagle input and ran beagle. with this dataset indeed the problem associated to the sequence dictionary is gone. GATK takes the vcf and vcf.idx specified in the -V option. but the problem for indexing .r2 unfortunately remains. It creates the. r2.idx but of size 0. and then throws the errors as above.

    i also ran the script as root but experienced the same problem.

    thanks!

    marlies

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hmm, that's odd. Could you please upload some snippet files that reproduce the error to our FTP so we can debug this locally? Instructions are here: http://www.broadinstitute.org/gatk/guide/article?id=1894

  • noanoa Boston areaMember

    Hello,
    I am receiving similar 2 consequent error while running the CombineVariants tool:
    At the first run "Couldn't write file C:....test1.vcf because unable to write Tribble index with exception"
    and at the second run "Unable to read index file, for input source: C:...test1.vcf.idx", since the idx was formed as an empty file.
    Due to technical constrains I am using a GATK-lite version which I know is not up to date, is there a solution to that problem that still uses the lite version?
    Thanks!
    Noa

  • MarDoleMarDole Member

    hi geraldine,
    i never got back to you about this. very sorry.
    now that i see someone else has a similar problem. i want to share what i figured out.
    i installed a lot of different older versions of GATK, until one of them gave a more meaning full answer.
    the problem was a that despite all folders were declared read write execute (777) the GATK java machine creates the .idx as read write execute by the owner only.
    if you are not that owner, the following process that actually then wants to write the info into the ,idx file fails.

    hope this helps.
    marlies

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @MarDole Thanks for sharing that information, I would not have guessed that.

  • MarDoleMarDole Member

    i run GATK in a virtual machine on shared hard drives. so that might be an added complication. but it clearly was a problem of system file permissions.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Yep, that makes sense. Unfortunately that's not something we can provide help with, but it's really useful for people to know that's what they should be looking into.

Sign In or Register to comment.