To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

Generate genotype specific FASTA sequences from VCF file and reference sequence

gthomsongthomson AucklandMember

I have some VCF files, each of which I have merged to contain >300 genotypes. Furthermore, to make them more manageable I have subsetted them to just contain the chromosome regions I am interested in.

Now I wish to generate some genotype specific FASTA sequences using these files and a reference sequence; i.e. a sequence for each genotype which is the same as the reference sequence but with the SNPs specific to each genotype in place of their counterparts in the reference sequence.

Now I know that there is variation in the genotypes. Here is a picture visualizing three exemplar genotypes that I generated by loading the VCF file into Geneious.

http://imgur.com/LRZHIcw

I then try to create individual VCF files for each genotype using this:

java -jar GenomeAnalysisTK.jar -R ~/Path/to/reference\sequence/ref.fasta -T SelectVariants --variant ~/Path/to/complete\vcf/example.vcf -o ~/Path/to/individual/genotype.vcf -sn genotype

While I can't be sure this had the desired effect as it is difficult to assess a whole VCF file I can say that the header now only contains the relevant genotype so I assume this is the case.

I then try and use this individual VCF file for each genotype like this:

java -jar GenomeAnalysisTK.jar -R ~/Path/to/reference\sequence/ref.fasta -T FastaAlternateReferenceMaker --variant~/Path/to/individual/genotype.vcf -L chrX:XX,XXX,XXX-XX,XXX,XXX -o ~/Path/to/individual/genotype.fasta

Here the Xs represent the location on the reference sequence of the regions of interest.

I did this in a loop and got identical sequences for every genotypes. I then implemented it individually for the 3 exemplar genotypes in the picture above and in both cases I get identical sequences for every genotype. Interestingly they are not the reference sequence.

What am I doing wrong?

I have also posted this on the Biostars forum..

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    I assume that by "genotype" here you actually mean sample. The genotype is just one of the properties that is emitted for each sample.

    Make sure you're trimming the alleles by using -env so that only the alleles that are in the selected sample remain.

  • gthomsongthomson AucklandMember

    Great thank you. I tried the -env flag for the 3 exemplar genotypes in the picture and got the sequences I wanted.

Sign In or Register to comment.