Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

dbSNP variant IDs

Will_GilksWill_Gilks University of Sussex, UKMember ✭✭

Hi Team,

I have two vcfs from D.melanogaster. The first contains in-house variant identifiers, the second contains NCBI-dbSNP variant identifiers. Only the first file contains the genotype data. There are 4 million variants in the first file and 5 million in the second. There is expected to be substantial overlap between the two in terms of what variants are present. How can I replace the in-house variant identifier with the dbSNP identifier for each corresponding variant ?

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Will_Gilks
    Hi,

    I don't think there is any GATK tool to do that, however, you might be able to use VCFTools. Can you tell me what your end goal is? Do you want to use GenotypeConcordance?

    Thanks,
    Sheila

  • Will_GilksWill_Gilks University of Sussex, UKMember ✭✭

    Hi @Sheila

    Well, really I have two problems in one. My first problem is lifting over a vcf to a more recent reference. I've solved this, using a combination of GATK (LiftOverVariants and FilterLiftedVariants), vcfsorter.pl, bash and reading into python to remove odd formatting.

    The genotype data has previously been submitted to NCBI dbSNP so that all variants have dbSNP IDs. However, the vcf that I am lifting over has the in-house variant IDs, not the dbSNP IDs.

    The only vcf which has the dbSNP IDs does not have the actual genotype calls.

    Thus my end-goal is to replace the in-house variant IDs in the original vcf with those from the dbSNP vcf (and then do the liftover). I'm going to work on a perl script for this but it's not ideal. I can't find a suitable tool in VCFtools. It's not a GenotypeConcordance goal.

    Ideally, once a vcf is uploaded to dbSNP, it is returned to the user with both dbSNP IDs AND the genotype calls. A list of dbSNP IDs and positions isn't immediately useful. Having said this, I haven't yet submitted genotype data to dbSNP so I don't know exactly what is returned. Finally, I should really be asking dbSNP this and not yourself but thanks anyway :smile: I shall go away and read NCBI's user manuals.

    Thanks all the same,

    Will

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Actually you can do this fairly easily with VariantAnnotator. Pass in the file with the dbsnp ids with the dbsnp argument. VA should overwrite the rsIDs in your input VCF.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Will_Gilks @Geraldine_VdAuwera
    Hi Will and Geraldine,

    I just tested Geraldine's suggestion, and it looks like VariantAnnotator does not overwrite the original ID field, but it does add the dbsnp rsid to the field. For example, if your sample id is Sheila1 in your original VCF, your final output VCF would have Sheila1;rs12345.

    I hope this helps.

    -Sheila

Sign In or Register to comment.