The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Did you remember to?

1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

Did we ask for a bug report?

Then follow instructions in Article#1894.

Formatting tip!

Surround blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block.
Powered by Vanilla. Made with Bootstrap.
Picard 2.9.0 is now available. Download and read release notes here.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

PhaseByTransmission and ReadBackedPhasing not emitting (at least some) indels?

KlausNZKlausNZ Member Posts: 58

Hi again,
I was surprised to notice that my phased VCFs produced by both phasing tools (alone or in succession) contained about 1% (PBT) or 2% (RBP) less variants than the input files; this was reproducible (PBT and RBP), and occurred with and without the -mvf option (PBT).
A quick scan (SelectVariants --discordance; thanks for providing that one ;-) indicates that the missing variants are all indels (mostly insertions, from 2-20 nt); note that I haven't tested whether the phased output file lacks all indels present in the input file.

Is this the expected behaviour of both tools or am I doing something terribly wrong?
If yes, is there an option to emit these variants together with the (phased and unphased variants) in the file specified with -o (I know I could use SelectVariants --discordance to add these back in a subsequent step, but there may be a more elegant solution)?

Sorry for the trouble, can't even remember why I counted before and after (but glad I did... )
[GATK 2.6-5] -T PhaseByTransmission -R ../human_g1k_v37_decoy.fasta -V IN.vcf -ped Trio1.ped -o OUT.vcf -mvf MV.vcf -pedValidationType SILENT

Best Answer


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie Posts: 11,414 admin

    Hi Klaus,

    This is not the expected behaviour and is actually rather worrying to us, so I'm glad you brought it up. Could you possibly upload some test files that reproduce the issue to our FTP server (instructions in the FAQs) so we can debug this locally?

    Geraldine Van der Auwera, PhD

  • KlausNZKlausNZ Member Posts: 58

    Hi Geraldine,
    Unfortunately I cannot share the VCF that I reported in my original post because our current ethics permit does not allow making the the patient variants publicly accessible (but I can share if you have a 'private' channel). However, I think I found a work-around with improved outcome for discussion in the forum; it's a bit complicated to explain so please forgive my wordy post. Here goes: The 'offending' VCF was created by mapping reads from seven patient exomes (see below for details), then combining these alignments with 23 1000 Genomes FIN and GBR exome alignments (Broad's finest!) for variant calling by HaplotypeCaller, followed by selecting my patients' genotypes from the 30-exomes variant file (SelectVariants, -se, -env) to produce the patients-only input VCF file for phasing as per my original post.

    For PhaseByTransmission:
    Because I can't share the patient variants, I selected variants for three 1000G individuals (SelectVariants, -sn HG00131, -sn HG00133, -sn HG00145, -env) from the 30-exomes variant file described above to produce a new VCF file that I can share. I 'made up' a .ped file pretending that the three individuals comprise a trio (although in real life they are likely unrelated, so plenty of Mendelian violations).

    (Un)fortunately, the story remains the same: Running PhaseByTransmission over IN.vcf results in an output file OUT.vcf with 20,411 less variants than in the input file (all counting done with grep); running SelectVariants --discordance over IN.vcf and OUT.vcf produces a file with the 20,411 missing variants.

    A bit of squinting at the missing variants revealed that appeared to be multi-allelic; indeed, grep'ing "AC=[0-9]+,[0-9]+" showed 20,411 matching records in IN.vcf, zero in OUT.vcf, and 20,411 in NotInOut.vcf. The same pattern was true for my patient-specific files (counts differed of course). For clarity, my my previous assessment of missing indels was wrong because I looked at the wrong column - sorry 'bout that, can we edit the message subject?
    I repeated the entire exercise with the only other GATK version I still have installed (2.5-2), with the same outcome. It appears that, in my hands, PhaseByTransmission did not emit the multi-allelic variants.

    For ReadBackedPhasing:
    Phasing the variants for the three 1000G individuals is still under way; however looking at my patient variants, only 20 (of 24,509) missing variants match the multi-allele 'AC' expression, while the majority (19,497) are 'rs' SNPs. So there seems to be a different cause (or I've messed up both phasing runs or the common input file). Please let me know if you want me to upload the file phased with ReadBackedPhasing once it's done.

    If I was a betting man, I'd put my money on me having messed up a preceding step rather than on a bug in two programs (even if they share code); also, it doesn't cause a significant problem for me, now that I've noticed it and can add the missing variants back to my call set. May I suggest you test whether the same occurs with one of your variant files?
    I have uploaded all relevant files and program message logs in the PBT folder (it appears that your server has trouble acknowledging successful completion of hour-long transfers; however file sizes look OK, and I've uploaded md5s for verification)

    Keen to hear your thoughts, many thanks for considering oddities like this!

    Details on the seven patient exomes: Two families (one with mum, dad, two kids used to phase with PBT as two trios (missing variants identical for both trios), the other family two kids and mum phased with RBP), captured with the 'old' Illumina 64M kit, 80-120M 100PE reads per individual, mapped according to the 'Best Practice' documents (but duplicate reads removed, not marked). BTW, VQSR wasn't particularly successful but that's another story and I currently blame the heterogeneous data set.

  • KlausNZKlausNZ Member Posts: 58

    Wow, good to know! I'll have a look at the RBP results then.....

  • KlausNZKlausNZ Member Posts: 58

    Dear Geraldine,
    Just tested with 2.7.1: All perfect for PhaseByTransmission. But problem still exits with ReadBackedPhasing. Will try to take a closer look soon.
    Many thanks for quickly addressing this!

  • ebanksebanks Broad InstituteMember, Broadie, Dev Posts: 692 admin

    Hi Klaus,

    Is it possible that the missing records for ReadBackedPhasing are all filtered records?

    Eric Banks, PhD -- Director, Data Sciences and Data Engineering, Broad Institute of Harvard and MIT

  • mrad09mrad09 MilwaukeeMember Posts: 1

    I'd like to know if this bug was fixed for the ReadBackedPhasing? I'm seeing the same situation with consecutive homozygous SNPs being left out of the output.vcf file.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie Posts: 11,414 admin

    Make sure you're running the latest version. If you still have a problem, post a new question with full details (command line, vcf records that show the problem, etc).

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.