To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at

converting per sample g.vcf files to fasta

I'm working on haploid bacteria, and I would like to create a fasta file of all positions for which I have data. I have looked at FastaAlternateReferenceMaker, but it seems to output reference alleles when the read depth at that position is 0. If that is the case, I would prefer N instead, since no coverage means we have no data at all for that position. So it might be ref, or it might be something else, and I'd rather not assume.

I think it should be possible to create a fasta file from the g.vcf file, since it is supposed to contain data on all positions, not just snips. Before I start working on a program, I would like to know if
1. There is a better/existing way to do what I described above
2. Am I correct in thinking that the g.vcf file is the best input filetype for this?

I know there are ways to convert bam to fasta, but for me that has several drawbacks:
1. I would have to deal with counting coverage myself
2. I would have to create the reverse-complement while reading the bam file, and detect things like PCR-duplicates and bad mappings
3. I won't have the benefit of gatk's fancy on the fly realignment step, or any of the other statistical computations that go into generating g.vcf files
4. bam files are big, g.vcf files are small, so using g.vcf files will be a lot faster

What do you think? I'd like to know my assumptions are correct before I start working on this.


Best Answer


Sign In or Register to comment.