The current GATK version is 3.8-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

#### ☞ Got a problem?

1. Search using the upper-right search box, e.g. using the error message.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ` ) each to make a code block as demonstrated here.

GATK version 4.beta.3 (i.e. the third beta release) is out. See the GATK4 beta page for download and details.

# Using Picardtools LiftoverVCF between species

CU BoulderMember

I have a VCF file produced by mapping reads from chimpanzee onto the human genome. I want to take this VCF and do a liftover to the chimpanzee genome using picardtools LiftoverVcf. However, when I do this, everything gets dumped in the 'rejected' folder and labeled "MismatchedRefAllele".

I understand why this would be a good feature when you are mapping to the same species, but is there anyway to change this behavior in the program? I tried setting the "LIFTOVER_MIN_MATCH" parameter to 0, but that doesn't seem to do anything.

One possibility is to switch the "REF" and "ALT" bases in the original VCF and ignore all positions that have an allele frequency of 100% (AF=1.000)--i.e. fixed differences between human and chimpanzee. However, I'd prefer not to mess with the file too much myself if possible.

I also tried liftovervariants from and earlier version of GATK (2.9), but it doesn't seem like it supports lifting over between species.

Any thoughts would be much appreciated!

Tagged:

The tricky thing is that there is no hard number for how many sites you need-- it depends on how noisy your data is (the less noisy, the fewer sites you need) and how much overlap you have between your sample callset and the resources. Because VQSR won't use all the SNPs in your resource; it can only use those that are also present in your callset. The more samples you have, the higher that number will be of course. You can get it to work on fewer sites by requesting fewer annotations (=fewer dimensions in the model) and fewer max clusters. But you need to be very confident in your input resources.

You can definitely try it, don't let me discourage you But it's important to have realistic expectations, especially if it's going to take a lot of effort to whip that resource dataset into a useable shape.

These tools are simply not designed for that use case. You would be better off realigning the original reads to the chimpanzee genome. I'm not very familiar with chimpanzee genomics, but what you're trying to do sounds unnecessarily complicated. Can you tell us more about what you're trying to achieve?
• CU BoulderMember

I'm using data from a study which mapped non-human short reads to a human reference. This is why the .vcf files have a human reference, not a chimpanzee reference.

I agree this is sort of a strange thing to do, and I also agree that remapping reads myself is the way to go (which I have done). The reason I want to liftover these human-mapped variants to the other species reference is because I was thinking this would be a good reference set to use to build the VQSR model with. Of course, I have the SNPs from SNPdb, however for some species in this study, there isn't SNP information on NCBI. So, for my purposes, it'd be great to use these human-mapped variants lifted over to that species genome to input into VQSR.

Hope that makes sense.

Hi @lstbl, thanks for clarifying. I don't think this set of variants would be particularly helpful for VQSR, for the following reason. If I understand correctly how it was produced, that set of variants describes how chimpanzees are different from humans. For doing VQSR on chimp data aligned to the chimp reference, you need a resource set of variants cataloguing variation among chimps. Anything else will produce suboptimal results if it runs at all.

• CU BoulderMember
edited June 2016

Hi Geraldine,

This is true, however I believe it could still give good SNP information. For example, take this entry from the beginning of the .vcf file (partially truncated for clarity):

CHROM POS ID REF ALT QUAL FILTER INFO
chr1 74125 . G A 5453.96 PASS AC=15;AF=0.441;AN=34

Given that you could lift this over, I believe you could still say this is a known SNP in the chimpanzee genome, no?

Sure, but that will only be the case at sites where the human and chimp references have identical sequences. I don't know enough about chimp genomes to say whether that will apply for enough of the sites in your dataset. But I expect you'll be missing out on a lot of the common variation that is specific to chimps -- so overall it's very unlikely that the dataset will be informative enough to be useable for VQSR. These tools require very large numbers of sites to work properly, and highly curated resource datasets. For humans we use genotyping chip results and several other resources that have been heavily validated. Frankly, I'm very skeptical that this dataset is going to come anywhere close to meeting that standard.

• CU BoulderMember

Ah, this makes sense. I suppose I was under the impression that the VQSR model could be build with fewer SNPs than what is required.

However, if I look at all positions that pass the quality filter AND have more than one allele at that site (i.e. AF<1.000), there are still >28 million positions. Even if less than half of these get properly lifted over, I feel like it'd be lots of information for VQSR. (but maybe not?)

Is there any literature out there that gives an idea of how many SNPs are necessary to adequately build the VQSR model? Sorry for my ignorance on the matter, but I wasn't able to find anything during the first iteration of literature searching.