Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Reading bgzipped BCF files

johnwallace123johnwallace123 Member ✭✭
edited April 2015 in Ask the GATK team

GATK Team,

I have recently started to look into using bgzipped BCF files as our primary means of input/output to GATK in order to save time parsing the VCF files. Unfortunately, due to space limitations, unzipped BCF files are not an option, as it looks like they're ~8x the size of a bgzipped VCF.

When I ran a simple "round trip" to convert vcf.gz -> bcf.gz -> vcf.gz (using SelectVariants) just to test the potential processing gains, I got the following error on the bcf.gz->vcf.gz leg:

##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version nightly-2015-04-30-gdd4ddcb): 
##### ERROR
##### ERROR This means that one or more arguments or inputs in your command are incorrect.
##### ERROR The error message below tells you what is the problem.
##### ERROR
##### ERROR If the problem is an invalid argument, please check the online documentation guide
##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
##### ERROR
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
##### ERROR
##### ERROR MESSAGE: Tabix indexed files only work with ASCII codecs, but received non-Ascii codec BCF2Codec, for input source: myFile.bcf.gz
##### ERROR ------------------------------------------------------------------------------------------

This issue persists with the nightly build as well.

Is the native reading of bcf.gz files something that is on the horizon for the GATK team, or is it still a long way off? It looks like this code is pretty deep in the htsjdk library, and fixing it may require a change to the class hierarchy.

Thanks,

John Wallace

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Update -- our software engineers told me that your (perfectly reasonable) assumption about saving parsing time is unlikely to hold: BCF performance is actually worse than VCF, because of how it's implemented in htsjdk. You may want to do some profile runs of BCF vs VCF vs VCF.gz on your data to see how runtime compares before putting in any development effort.

  • johnwallace123johnwallace123 Member ✭✭

    @Geraldine_VdAuwera

    Thanks for the update. I'm surprised that the BCF performance would be worse, but the only performance testing that I've done has been vcf->bcf and bcf->vcf (using the SelectVariants in a probably silly way), and I found that bcf->vcf was about 30% faster than the reverse (7.06 min vs. 10.8 min). I've also had projects where the conversion of "1/1" to [1, 1] was a sufficiently large portion of the runitme that it was worthwhile to write a custom "atoi" function that used some assumptions (<65K ALT alleles and nonnegative ints). I improved processing runtime about 35% in that case.

    Of course, I'm explicitly testing a scenario where the processing of the variants into memory SHOULD be the limiting factor and it's only one test case. It's not clear to me how much overhead there is in reading files in a "typical" analysis step, and that probably varies by step. This is directly related to what you said in that there may be very little gain in going down this path, but it sure would be nice to have.

Sign In or Register to comment.