How to create fasta file of sequences using VCF informations?

Hi GATK team,

I would like to retrieve the variants sequences for some genes of interest. Basically I have a BAM file of reads aligned to a reference genome, that I have used to create a VCF file. I realized that, for a given position, the BAM file has more reads than the number of reads counted in the VCF file for this same given position. Probably du to filtering step that is taking into account only reads with good quality etc...
Is there a way to select reads that had been taken into account in the VCF file? Where can I see the reads that participate to the variant and the one that had been filtered out?
Is there a ways to reconstruct fasta file with sequence of the variant by using the VCF file? I have a GFF file in hand too if it can help.
I am kind of lost right now and I can't find this informations anywhere on the web.
Thanks a lot for your help.
Nathalie.

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Nathalie,

    There are several ways to approach this problem. The most accurate is to have HaplotypeCaller emit an -outbam file as described here. But this adds some time to the processing.

    Another way is to use PrintReads with the same read filters as used by HC as described in the tool doc. This has the advantage that you can do it as post-processing (without recalling your samples) but it doesn't account for any realignment that HC may have done to the reads.

Sign In or Register to comment.