The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Powered by Vanilla. Made with Bootstrap.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.
Register now for the upcoming GATK Best Practices workshop, Feb 20-22 in Leuven, Belgium. Open to all comers! More info and signup at http://bit.ly/2i4mGxz

Using Picardtools LiftoverVCF between species

lstbllstbl CU BoulderMember Posts: 7

I have a VCF file produced by mapping reads from chimpanzee onto the human genome. I want to take this VCF and do a liftover to the chimpanzee genome using picardtools LiftoverVcf. However, when I do this, everything gets dumped in the 'rejected' folder and labeled "MismatchedRefAllele".

I understand why this would be a good feature when you are mapping to the same species, but is there anyway to change this behavior in the program? I tried setting the "LIFTOVER_MIN_MATCH" parameter to 0, but that doesn't seem to do anything.

One possibility is to switch the "REF" and "ALT" bases in the original VCF and ignore all positions that have an allele frequency of 100% (AF=1.000)--i.e. fixed differences between human and chimpanzee. However, I'd prefer not to mess with the file too much myself if possible.

I also tried liftovervariants from and earlier version of GATK (2.9), but it doesn't seem like it supports lifting over between species.

Any thoughts would be much appreciated!

Best Answer

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 11,117 admin
    Accepted Answer

    The tricky thing is that there is no hard number for how many sites you need-- it depends on how noisy your data is (the less noisy, the fewer sites you need) and how much overlap you have between your sample callset and the resources. Because VQSR won't use all the SNPs in your resource; it can only use those that are also present in your callset. The more samples you have, the higher that number will be of course. You can get it to work on fewer sites by requesting fewer annotations (=fewer dimensions in the model) and fewer max clusters. But you need to be very confident in your input resources.

    You can definitely try it, don't let me discourage you :) But it's important to have realistic expectations, especially if it's going to take a lot of effort to whip that resource dataset into a useable shape.

    Geraldine Van der Auwera, PhD

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 11,117 admin
    These tools are simply not designed for that use case. You would be better off realigning the original reads to the chimpanzee genome. I'm not very familiar with chimpanzee genomics, but what you're trying to do sounds unnecessarily complicated. Can you tell us more about what you're trying to achieve?

    Geraldine Van der Auwera, PhD

  • lstbllstbl CU BoulderMember Posts: 7

    Hi Geraldine, thank you for your reply.

    I'm using data from a study which mapped non-human short reads to a human reference. This is why the .vcf files have a human reference, not a chimpanzee reference.

    I agree this is sort of a strange thing to do, and I also agree that remapping reads myself is the way to go (which I have done). The reason I want to liftover these human-mapped variants to the other species reference is because I was thinking this would be a good reference set to use to build the VQSR model with. Of course, I have the SNPs from SNPdb, however for some species in this study, there isn't SNP information on NCBI. So, for my purposes, it'd be great to use these human-mapped variants lifted over to that species genome to input into VQSR.

    Hope that makes sense.

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 11,117 admin

    Hi @lstbl, thanks for clarifying. I don't think this set of variants would be particularly helpful for VQSR, for the following reason. If I understand correctly how it was produced, that set of variants describes how chimpanzees are different from humans. For doing VQSR on chimp data aligned to the chimp reference, you need a resource set of variants cataloguing variation among chimps. Anything else will produce suboptimal results if it runs at all.

    Geraldine Van der Auwera, PhD

  • lstbllstbl CU BoulderMember Posts: 7
    edited June 2016

    Hi Geraldine,

    This is true, however I believe it could still give good SNP information. For example, take this entry from the beginning of the .vcf file (partially truncated for clarity):

    CHROM POS ID REF ALT QUAL FILTER INFO
    chr1 74125 . G A 5453.96 PASS AC=15;AF=0.441;AN=34

    Given that you could lift this over, I believe you could still say this is a known SNP in the chimpanzee genome, no?

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 11,117 admin

    Sure, but that will only be the case at sites where the human and chimp references have identical sequences. I don't know enough about chimp genomes to say whether that will apply for enough of the sites in your dataset. But I expect you'll be missing out on a lot of the common variation that is specific to chimps -- so overall it's very unlikely that the dataset will be informative enough to be useable for VQSR. These tools require very large numbers of sites to work properly, and highly curated resource datasets. For humans we use genotyping chip results and several other resources that have been heavily validated. Frankly, I'm very skeptical that this dataset is going to come anywhere close to meeting that standard.

    Geraldine Van der Auwera, PhD

  • lstbllstbl CU BoulderMember Posts: 7

    Ah, this makes sense. I suppose I was under the impression that the VQSR model could be build with fewer SNPs than what is required.

    However, if I look at all positions that pass the quality filter AND have more than one allele at that site (i.e. AF<1.000), there are still >28 million positions. Even if less than half of these get properly lifted over, I feel like it'd be lots of information for VQSR. (but maybe not?)

    Is there any literature out there that gives an idea of how many SNPs are necessary to adequately build the VQSR model? Sorry for my ignorance on the matter, but I wasn't able to find anything during the first iteration of literature searching.

    Much appreciate your insight!

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 11,117 admin
    Accepted Answer

    The tricky thing is that there is no hard number for how many sites you need-- it depends on how noisy your data is (the less noisy, the fewer sites you need) and how much overlap you have between your sample callset and the resources. Because VQSR won't use all the SNPs in your resource; it can only use those that are also present in your callset. The more samples you have, the higher that number will be of course. You can get it to work on fewer sites by requesting fewer annotations (=fewer dimensions in the model) and fewer max clusters. But you need to be very confident in your input resources.

    You can definitely try it, don't let me discourage you :) But it's important to have realistic expectations, especially if it's going to take a lot of effort to whip that resource dataset into a useable shape.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.