We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

why data are missing in vcf file?

vipinrangavipinranga FinlandMember
edited September 2014 in Ask the GATK team

I am doing SNP analysis on a plant species. For SNP finding process another plant species has been taken (because phylogenetically both species are very close) whose reference genome is available on NCBI. The vcf generated is not showing main information, first few lines of output is like this.

speciesch01           1        .         A             .              .       .                .       GT      ./.
speciesch01           2        .         G             .              .       .                .       GT      ./.
speciesch01           3        .         A             .              .       .                .       GT      ./.
speciesch01           4        .         G             .              .       .                .       GT      ./.
speciesch01           5        .         G             .              .       .                .       GT      ./.
speciesch01           6        .         T             .              .        .                .       GT      ./.

I have used the commands as follows:

1) java -Xmx16g -jar GenomeAnalysisTK.jar -R Reference_genome.fasta -T UnifiedGenotyper -I species.realigned.bam -o species.realigned.snps.vcf -stand_call_conf 30 -stand_emit_conf 10 --output_mode EMIT_ALL_SITES

2) java -Xmx16g -jar GenomeAnalysisTK.jar -T BaseRecalibrator -I species.realigned.bam -R Reference_genome.fasta -knownSites species.realigned.snps.vcf -o species.realigned_data.grp

3) java -Xmx16g -jar GenomeAnalysisTK.jar -T PrintReads -I species.realigned.bam -R Reference_genome.fasta -BQSR species.realigned_data.grp -o species.realigned.recal.bam

4) java -Xmx16g -jar GenomeAnalysisTK.jar -R Reference_genome.fasta -T UnifiedGenotyper -I species.realigned.bam --dbsnp species.realigned.snps.vcf -o species.realigned.recal.snps.vcf -stand_call_conf 30 -stand_emit_conf 10 --output_mode EMIT_ALL_SITES

Among the UnifiedGenotyper commands i.e. 1 and 4, the dbsnp flag has been used only in 4th command because I don't have any available snp information about both of the plant species. I need a vcf file with all data for further analysis. Please let me know where I am making mistake and what are the possible solutions of this problem?

Post edited by Geraldine_VdAuwera on


  • vipinrangavipinranga FinlandMember

    The vcf format is not looking good in this post, I am attaching a png file of the same data. Have a look.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @vipinranga,

    This shows that the program was not able to make any calls at these positions because there was no data. There can be several reasons for this. The most frequent is that there are is indeed no read coverage there, either because it is exome data and those positions are not in the target range, or because it's at the edge of a contig/chromosome, which are often uncovered. Another possibility is that there is data but all the reads were filtered out.

    You should check 3 things:

    1. Was your data was pre-processed according to our Best Practices recommendations?
    2. Is there any read coverage data at those sites (using e.g. IGV)?
    3. Were calls made at other positions in the genome?

    More generally, your workflow is not ideal, because you are using unfiltered raw sites as known sites for BQSR, which may undermine the efficacy of the process. You should apply some filtering to that output before calling variants, and you should check the results of the recalibration by producing the recalibration plots.

    Also, we recommend using HaplotypeCaller rather than UnifiedGenotyper now. Finally, if you are working on plants, is your plant species diploid? This is important because the caller makes certain mathematical adjustments depending on the ploidy.

Sign In or Register to comment.