We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Ambiguity in the reads for a same position

nathajolinathajoli quebecMember

Hi everyone,

I am quite new in the field of metagenomic analysis, so please excuse me if I ask my questions in a strange way!

I am working on a metagenomic dataset, and I am interested on adaptation in a specific species, the algae Bathycoccus prasinos. What I did was first to align my reads against my reference genome of Bathycoccus, then I created a vcf file using GATK UnifiedGenotyper.

There is my command line:
java -jar ../apps/GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -R ../genome/Bathycoccus_genome_FINAL_RELEASE.fasta -T UnifiedGenotyper -glm BOTH -I ../BAMfile/ReadsConcat_Bathy_VerySensitiveLocal_bowtie_sorted_readsgroup.bam -o output_GATK_test

I was wondering myself many things:

  1. I am not sure to understand the parameter -dbsnp, is it a big deal if I didn't used it in my command line? From what I've understood it is a database that lists SNP often found and helps to make the difference between snp and sequencing error? Is it specific to each species? Is it going to change my vcf results if I dont use it?

  2. I tried to find this information every where on internet, I could not success. I am wondering what is doing the program when you have some ambiguity in the reads for a same position? Imagine that we have many reads that aligned to the same position, and they display different bases than the reference and different bases between each others reads. What s happening in this case? The SNP is ignored? It choose the most abundant one? It refers to the quality score?

Thanks a lot for your answer, I am quite stuck right now because I am not sure of my vcf file.

Nathalie Joli

Best Answers


  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭

    Hi Nathalie,

    1) It is not a big deal if you do not use the dbsnp parameter in your command. The dbsnp parameter simply fills in the rsID column of the vcf. https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_genotyper_UnifiedGenotyper.php#--dbsnp Having the dbsnp rsID in your vcf will not help distinguish true variation from false positives. I think you are thinking of VQSR (the step after Unified Genotyper), in which dbsnp is used as a resource set. For VQSR, it does not matter if the dbsnp rsID is in the vcf. I think dbSNP is specific to each species, but @Geraldine_VdAuwera will have to confirm. In any case, it will not change your vcf if you do not use it.

    2) Can you post an IGV screenshot of the site you are referring to? It depends on many different factors, including how many different alleles there are and base quality.

    You should really consider using Haplotype Caller instead of Unified Genotyper, as we do not recommend Unified Genotyper anymore.


  • nathajolinathajoli quebecMember

    Hi Sheila,

    Thank you very much for your answer.

    I understand now what is the utility of the -dbsnp parameter.

    I don't really have any specific case in mind, I was wondering what decision it makes in every case. Do you think that I could find some documentation related to that? I would like to understand all scenarios that may be encountered.

    Last question, why do you not recommend Unified Genotyper anymore? I am working on eukaryotic microalgae, Bathycoccus prasinos.

    Thanks a lot.


  • nathajolinathajoli quebecMember

    Thank you Sheila!
    Have a nice day :)

Sign In or Register to comment.