Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office on October 14, 2019, due to the U.S. holiday. We will return to monitoring the forum on October 15.

dbSNP variant IDs

Will_GilksWill_Gilks ✭✭University of Sussex, UKMember ✭✭

Hi Team,

I have two vcfs from D.melanogaster. The first contains in-house variant identifiers, the second contains NCBI-dbSNP variant identifiers. Only the first file contains the genotype data. There are 4 million variants in the first file and 5 million in the second. There is expected to be substantial overlap between the two in terms of what variants are present. How can I replace the in-house variant identifier with the dbSNP identifier for each corresponding variant ?

Best Answer

Answers

  • SheilaSheila admin Broad InstituteMember, Broadie, Moderator admin

    @Will_Gilks
    Hi,

    I don't think there is any GATK tool to do that, however, you might be able to use VCFTools. Can you tell me what your end goal is? Do you want to use GenotypeConcordance?

    Thanks,
    Sheila

  • Will_GilksWill_Gilks ✭✭ University of Sussex, UKMember ✭✭

    Hi @Sheila

    Well, really I have two problems in one. My first problem is lifting over a vcf to a more recent reference. I've solved this, using a combination of GATK (LiftOverVariants and FilterLiftedVariants), vcfsorter.pl, bash and reading into python to remove odd formatting.

    The genotype data has previously been submitted to NCBI dbSNP so that all variants have dbSNP IDs. However, the vcf that I am lifting over has the in-house variant IDs, not the dbSNP IDs.

    The only vcf which has the dbSNP IDs does not have the actual genotype calls.

    Thus my end-goal is to replace the in-house variant IDs in the original vcf with those from the dbSNP vcf (and then do the liftover). I'm going to work on a perl script for this but it's not ideal. I can't find a suitable tool in VCFtools. It's not a GenotypeConcordance goal.

    Ideally, once a vcf is uploaded to dbSNP, it is returned to the user with both dbSNP IDs AND the genotype calls. A list of dbSNP IDs and positions isn't immediately useful. Having said this, I haven't yet submitted genotype data to dbSNP so I don't know exactly what is returned. Finally, I should really be asking dbSNP this and not yourself but thanks anyway :smile: I shall go away and read NCBI's user manuals.

    Thanks all the same,

    Will

  • Geraldine_VdAuweraGeraldine_VdAuwera admin Cambridge, MAMember, Administrator, Broadie admin

    Actually you can do this fairly easily with VariantAnnotator. Pass in the file with the dbsnp ids with the dbsnp argument. VA should overwrite the rsIDs in your input VCF.

  • SheilaSheila admin Broad InstituteMember, Broadie, Moderator admin

    @Will_Gilks @Geraldine_VdAuwera
    Hi Will and Geraldine,

    I just tested Geraldine's suggestion, and it looks like VariantAnnotator does not overwrite the original ID field, but it does add the dbsnp rsid to the field. For example, if your sample id is Sheila1 in your original VCF, your final output VCF would have Sheila1;rs12345.

    I hope this helps.

    -Sheila

Sign In or Register to comment.