Hi GATK Users,

Happy Thanksgiving!
Our staff will be observing the holiday and will be unavailable from 22nd to 25th November. This will cause a delay in reaching out to you and answering your questions immediately. Rest assured we will get back to it on Monday November 26th. We are grateful for your support and patience.
Have a great holiday everyone!!!

Regards
GATK Staff

Using Picardtools LiftoverVCF between species

lstbllstbl CU BoulderMember

I have a VCF file produced by mapping reads from chimpanzee onto the human genome. I want to take this VCF and do a liftover to the chimpanzee genome using picardtools LiftoverVcf. However, when I do this, everything gets dumped in the 'rejected' folder and labeled "MismatchedRefAllele".

I understand why this would be a good feature when you are mapping to the same species, but is there anyway to change this behavior in the program? I tried setting the "LIFTOVER_MIN_MATCH" parameter to 0, but that doesn't seem to do anything.

One possibility is to switch the "REF" and "ALT" bases in the original VCF and ignore all positions that have an allele frequency of 100% (AF=1.000)--i.e. fixed differences between human and chimpanzee. However, I'd prefer not to mess with the file too much myself if possible.

I also tried liftovervariants from and earlier version of GATK (2.9), but it doesn't seem like it supports lifting over between species.

Any thoughts would be much appreciated!

Best Answer

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MA admin
    Accepted Answer

    The tricky thing is that there is no hard number for how many sites you need-- it depends on how noisy your data is (the less noisy, the fewer sites you need) and how much overlap you have between your sample callset and the resources. Because VQSR won't use all the SNPs in your resource; it can only use those that are also present in your callset. The more samples you have, the higher that number will be of course. You can get it to work on fewer sites by requesting fewer annotations (=fewer dimensions in the model) and fewer max clusters. But you need to be very confident in your input resources.

    You can definitely try it, don't let me discourage you :) But it's important to have realistic expectations, especially if it's going to take a lot of effort to whip that resource dataset into a useable shape.

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    These tools are simply not designed for that use case. You would be better off realigning the original reads to the chimpanzee genome. I'm not very familiar with chimpanzee genomics, but what you're trying to do sounds unnecessarily complicated. Can you tell us more about what you're trying to achieve?
  • lstbllstbl CU BoulderMember

    Hi Geraldine, thank you for your reply.

    I'm using data from a study which mapped non-human short reads to a human reference. This is why the .vcf files have a human reference, not a chimpanzee reference.

    I agree this is sort of a strange thing to do, and I also agree that remapping reads myself is the way to go (which I have done). The reason I want to liftover these human-mapped variants to the other species reference is because I was thinking this would be a good reference set to use to build the VQSR model with. Of course, I have the SNPs from SNPdb, however for some species in this study, there isn't SNP information on NCBI. So, for my purposes, it'd be great to use these human-mapped variants lifted over to that species genome to input into VQSR.

    Hope that makes sense.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @lstbl, thanks for clarifying. I don't think this set of variants would be particularly helpful for VQSR, for the following reason. If I understand correctly how it was produced, that set of variants describes how chimpanzees are different from humans. For doing VQSR on chimp data aligned to the chimp reference, you need a resource set of variants cataloguing variation among chimps. Anything else will produce suboptimal results if it runs at all.

  • lstbllstbl CU BoulderMember
    edited June 2016

    Hi Geraldine,

    This is true, however I believe it could still give good SNP information. For example, take this entry from the beginning of the .vcf file (partially truncated for clarity):

    CHROM POS ID REF ALT QUAL FILTER INFO
    chr1 74125 . G A 5453.96 PASS AC=15;AF=0.441;AN=34

    Given that you could lift this over, I believe you could still say this is a known SNP in the chimpanzee genome, no?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Sure, but that will only be the case at sites where the human and chimp references have identical sequences. I don't know enough about chimp genomes to say whether that will apply for enough of the sites in your dataset. But I expect you'll be missing out on a lot of the common variation that is specific to chimps -- so overall it's very unlikely that the dataset will be informative enough to be useable for VQSR. These tools require very large numbers of sites to work properly, and highly curated resource datasets. For humans we use genotyping chip results and several other resources that have been heavily validated. Frankly, I'm very skeptical that this dataset is going to come anywhere close to meeting that standard.

  • lstbllstbl CU BoulderMember

    Ah, this makes sense. I suppose I was under the impression that the VQSR model could be build with fewer SNPs than what is required.

    However, if I look at all positions that pass the quality filter AND have more than one allele at that site (i.e. AF<1.000), there are still >28 million positions. Even if less than half of these get properly lifted over, I feel like it'd be lots of information for VQSR. (but maybe not?)

    Is there any literature out there that gives an idea of how many SNPs are necessary to adequately build the VQSR model? Sorry for my ignorance on the matter, but I wasn't able to find anything during the first iteration of literature searching.

    Much appreciate your insight!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    Accepted Answer

    The tricky thing is that there is no hard number for how many sites you need-- it depends on how noisy your data is (the less noisy, the fewer sites you need) and how much overlap you have between your sample callset and the resources. Because VQSR won't use all the SNPs in your resource; it can only use those that are also present in your callset. The more samples you have, the higher that number will be of course. You can get it to work on fewer sites by requesting fewer annotations (=fewer dimensions in the model) and fewer max clusters. But you need to be very confident in your input resources.

    You can definitely try it, don't let me discourage you :) But it's important to have realistic expectations, especially if it's going to take a lot of effort to whip that resource dataset into a useable shape.

Sign In or Register to comment.