ReadBackedPhasing didn't merge consecutive phased sites into MNP records ?

frankfengfrankfeng Member
edited October 2015 in Ask the GATK team

Dear GATK team,

I have been using GATK for years on variant calling (thank you first!), but this is the first time I tried ReadBackedPhasing to emit MNPs, and it didn't output any MNPs. Could you please help me to find out the reason?

I was calling variants on whole exome data, and, following the GATK's best practices, basically the procedure was:

HaplotypeCaller with --emitRefConfidence GVCF --> GenotypeGVCFs --> VQSR --> ReadBackedPhasing with -enableMergeToMNP

Every step went successfully without errors.

In the output VCF file of ReadBackedPhasing, tag HP, which was not in the output of previous steps, was added to the genotype section, but no consecutive phased sites were merged into MNP records. I have checked those variants, and many of them should have been merged into MNP records. For example, the alignment of one of them is shown in the attached figure.

The version of GATK that I tried are nightly-2015-08-16-g3087cd7 and nightly-2015-10-29-gc252559. The complete command I used for ReadBackedPhasing is shown below, and I also used options --maxGenomicDistanceForMNP and --phaseQualityThresh but still no MNPs were emitted.

java -Xmx9g -jar GenomeAnalysisTK.jar \
    -T ReadBackedPhasing \
    -R human_g1k_v37_decoy.fasta \
    -enableMergeToMNP \
    --variant VQSR_out.vcf.gz \
    -L VQSR_out.vcf.gz \
    -o out.vcf \
    -I 2.bam \
    -I 3.bam \
    -I 4.bam \
    -I 5.bam \
    -et NO_ET -K key_file

Thanks,

Frank

image

Post edited by frankfeng on

Comments

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @frankfeng
    Hi Frank,

    What value did you use for --maxGenomicDistanceForMNP? Can you try making the value higher?

    -Sheila

  • Thanks @Sheila

    I have tried 25, 5 and the default value for --maxGenomicDistanceForMNP, but they didn't change the results. I also tried 5, 1 and the default value for --phaseQualityThresh together with big value of --maxGenomicDistanceForMNP, but still there were no MNPs emitted.

    I used ReadBackedPhasing on almost a million of human variants and no MNP was emitted. As I am using the nightly build versions (nightly-2015-08-16-g3087cd7 and nightly-2015-10-29-gc252559), could you please ask the author of ReadBackedPhasing if there are any changes affecting the feature of emitting MNPs? Or there might be a potential issue/bug in the recent commits?

    Thanks.

    Frank

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @frankfeng
    Hi Frank,

    Can you please submit a bug report for us to investigate locally? Instructions are here: http://gatkforums.broadinstitute.org/discussion/1894/how-do-i-submit-a-detailed-bug-report

    Thanks,
    Sheila

  • Hi @Sheila

    I have already uploaded the test file (named test_readBackedPhasing.tar.gz) onto your FTP server.

    Thanks!

    Regards,

    Frank

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @frankfeng
    Hi Frank,

    I just submitted a bug report. I will keep you updated about the status.

    Thanks,
    Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @frankfeng Thanks for your bug report; the problem bug has been fixed in our development version. You can use the fixed version by downloading the next nightly build (available in the Downloads section) starting tomorrow, or wait until the next official release.

  • frankfengfrankfeng Member

    @Geraldine_VdAuwera Thank you so much! I will try the nightly build!

  • frankfengfrankfeng Member

    @Geraldine_VdAuwera I have tried the nightly-2016-01-07-gf3f11fe version, and it worked. Thank you very much for your work!

    Are you planning to officially release a bug-fixed version for V3.5? If so, approximatively when will it be? Thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @frankfeng, glad to hear it worked. Because RBP is not a core tool, we don't consider that it warrants a patch release, so it's unlikely that we'll make an official 3.5-x release unless we find a bug that is considered more important in one of the core tools. Failing that, for an official release you'd have to wait for version 3.6. Now, normally that would be at least 6 to 8 weeks down the road, but we do need to bump GATK to Java 8 in the near future to stay in sync with htsjdk and Picard, so we might cut an early release just for that. Hard to say exactly when that might happen though.

  • frankfengfrankfeng Member
    edited January 2016

    Hi @Geraldine_VdAuwera RBP now can merge consecutive phased sites into MNP records, but there is another potential issue.

    RBP (nightly-2016-01-07-gf3f11fe) crashed with ERROR MESSAGE: Unexpected base in allele bases '**', with input VCF, in which there were such three variants (here I only show you the first 7 fields):

    3       62518392        .       TTGTGTG T,TTG   46263.3 PASS
    3       62518397        .       T       *       24145.9 PASS
    3       62518398        .       G       *       23561.7 PASS
    

    The input VCF was generated with GATK nightly-2015-08-16-g3087cd7, with HaplotypeCaller and GenotypeGVCFs with human reference genome (b37), and following SelectVariants --excludeNonVariants --removeUnusedAlternates to extract variants in a subset of samples.

    It looks like RBP crashed when trying to merge two consecutive * ALT alleles. Also, it seems SelectVariants --excludeNonVariants failed to remove the last two variant records, which were not actually variants -- In this case, it would have kept only the first variant record.

    Thanks.

    Post edited by frankfeng on
  • frankfengfrankfeng Member
    edited January 2016

    Sorry, writing correction for my previous post:

    The input VCF was generated with GATK nightly-2015-08-16-g3087cd7, with HaplotypeCaller and GenotypeGVCFs with human reference genome (b37), followed by SelectVariants --excludeNonVariants --removeUnusedAlternates to extract variants in a subset of samples.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @frankfeng
    Hi Frank,

    So, you are saying that SelectVariants should remove the last record with the * alternate allele but it does not? Are those sites in the original files you uploaded? If not, we will need you to submit another bug report.

    Thanks,
    Sheila

  • frankfengfrankfeng Member

    Hi @Sheila

    Yes, SelectVariants with --excludeNonVariants should remove the last two records with the * alternate allele but it does not.

    Can you please submit two bug reports for the potential issue? I have already uploaded test files onto your FTP server (file "test_again_readBackedPhasing.tar" for readBackedPhasing and "test_SelectVariants.tar" for SelectVariants) .

    Thank you so much!

    Frank

    Issue · Github
    by Sheila

    Issue Number
    464
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @frankfeng
    Hi Frank,

    Thanks. I am about to have a look at these, and I hope to submit bug reports by tomorrow.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @frankfeng
    Hi,

    Sorry for the delay. I have just submitted a bug report. You can track the issue here.

    -Sheila

  • Yes, SelectVariants with --excludeNonVariants should remove the last two records with the * alternate allele but it does not.

    Thanks @Sheila Could you also submit a bug report for SelectVariants as described above? I didn't see that in the issue track web-page... Thank you!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @frankfeng For the SelectVariants part of your issue we're still deciding exactly what to do (drop those sites by default or provide an option to control the behavior).

  • sls5775sls5775 UT, AustinMember
    edited April 2016

    Hi all -- I think I'm having the same (original) problem with nightly build vnightly-2016-04-05-gf73e637.

    Apologies if I missed something, but running ReadBackedPhasing on a vcf file created by GenotypeGVCFs as follows does not seem to create a MNP marker out of the SNP calls as expected:

    java -Xmx25G -jar ~/GenomeAnalysisTK.jar -T ReadBackedPhasing -enableMergeToMNP \
    -R example_MNP_pvirgatum/Panicum_virgatum.mainGenome.scaffolds.fasta \
    --variant example_MNP_pvirgatum/example.vcf --sampleToPhase A_237_RE1_R12 -L Chr01a \
    -I example_MNP_pvirgatum/example.sorted.bam -o example_MNP_pvirgatum/examplePHASED.vcf

    produces

    CHROM POS ID REF ALT QUAL FILTER INFO FORMAT A_237_RE1_R12
    Chr01a 97357440 . G A 7249.68 . . GT:GQ:HP 0/1:99:97357440-1,97357440-2
    Chr01a 97357450 . T C 7247.66 . . GT:GQ:HP:PQ 0/1:99:97357440-1,97357440-2:547.25
    Chr01a 97357459 . T C 7248.12 . . GT:GQ:HP:PQ 0/1:99:97357440-1,97357440-2:548.59
    Chr01a 97357491 . T C 9201.92 . . GT:GQ:HP:PQ 0/1:99:97357440-1,97357440-2:557.25

    but the reads suggest that a MNP is appropriate in this example.

    image

    Thoughts?

    In terms of the broader picture, I have ~400 samples which I would like to genotype using MNPs (as the samples are F2's from a four-way cross and SNPs cannot distinguish between parents). I was hoping the ReadBackedPhasing tool would not only identify MNPs within a sample, but also be aware of MNPs across samples and extend and call MNPs relative to the population as opposed to only within the sample. (E.g., I'm imagining there might be a case where a reference allele SNP is useful as a component of a MNP marker across samples but there is no alternative variant within a sample which would suggest this SNP should necessarily be part of the MNP call for the sample just based on the sample itself without accounting for the rest of the population).

    Thanks -- I appreciate your kind help,
    Scott

  • sls5775sls5775 UT, AustinMember

    Apologies -- I did indeed miss the obvious.

    Including the "-maxDistMNP 1000" flag produces

    CHROM POS ID REF ALT QUAL FILTER INFO FORMAT A_237_RE1_R12
    Chr01a 97357440 . GAAGTTTCTATATCTTGCTTAAGCCCTTTCCAAAAGAAATGGTTCTTGACTT AAAGTTTCTACATCTTGCTCAAGCCCTTTCCAAAAGAAATGGTTCTTGACTC 9201.92 PASS AC=1;AF=0.500;AN=2 GT:GQ:HP:PQ 0/1:99:97357440-1,97357440-2:547.25

    Very nice! Not to be petulant or attempt to excuse my oversight and blunder, but perhaps a larger default value would be preferable?
    In my case we have PE reads that should come from fragments of <500bp (but perhaps there are some larger fragments here and there) and I'd like the physical phasing to be done across the paired reads to produce as informative MNPs as possible.
    Of course one might simply understand the flags and their uses a little better as well...

    Thanks for all your hard work and outstanding contributions,
    Scott

    Issue · Github
    by Sheila

    Issue Number
    796
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @sls5775
    Hi Scott,

    I am happy you fixed the issue yourself.
    You are right about making the default value higher. Let me make a note and see what we can do. But, you're also right we expect users to "simply understand the flags and their uses a little better as well..." :)

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @sls5775
    Hi again Scott,

    The original intent of merging MNPs was to merge SNPs that are immediately side by side and therefore very likely to have originated from the same biological event. Extending the range increases the chance of merging things that don't belong together.

    We have decided to leave the default value as it is, but make the default behavior clearer in the documentation.

    -Sheila

  • sls5775sls5775 UT, AustinMember
    edited April 2016

    Fabulous -- thanks for following up on that, Sheila.

    Speaking of which...
    As I've continued to follow in Frankfeng's footsteps I've arrived at a similar problem:
    "ERROR MESSAGE: Unexpected base in allele bases '*GTGC*'".

    I see the "ReadBackedPhasing should ignore * alleles" ticket on https://www.broadinstitute.org/gatk/guide/issue-tracker is active and appears to clearly define the scope of the issue. Are you guys close to a fix on that ticket? If not, do you have any suggestions as to how I might proceed?

    Interestingly, when I run a cohort instead of a single sample I don't get an error message. But I also don't appear to get any MNP creation, either. Thoughts on that?

    Thanks :)
    Scott

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @sls5775
    Hi Scott,

    I think a fix will be in the nightly builds very soon! I will post here when it is in.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @sls5775 @frankfeng
    Hi!

    The fix just went in today. You should be able to download the nightly build tomorrow and the fix will be in it :smile:

    -Sheila

  • sls5775sls5775 UT, AustinMember

    Hi all (and Sheila!) -- thanks so much for the fix!

    I've used 3.6 with java 8 to very happy effect; however, I'm wanting to make MNPs across multiple samples and I'm seeing some (probably very expected) "unexpected" results:

    When I run ReadBackedPhasing on a single sample, I get:

    Sample: A_237_RE1_R12 Sites tested: 570 Sites phased: 404 Phase-inconsistent sites: 74 [phased: 69, unphased:5]

    and things are looking great! =)

    And when I run ReadBackedPhasing with a second sample added but using "--sampleToPhase A_237_RE1_R12", I get:

    Sample: A_237_RE1_R12 Sites tested: 850 Sites phased: 598 Phase-inconsistent sites: 142 [phased: 134, unphased:8]

    and things look even better than great! =D

    But when I run ReadBackedPhasing with a second sample added, and DO NOT use "--sampleToPhase A_237_RE1_R12", I get:

    Sample: A_237_RE1_R12 Sites tested: 486 Sites phased: 346 Phase-inconsistent sites: 71 [phased: 62, unphased:9]
    Sample: A_239_RE1_R12 Sites tested: 625 Sites phased: 451 Phase-inconsistent sites: 87 [phased: 80, unphased:7]

    which tests and finds fewer sites for the purposes of MNP creation in the initial sample.

    In trying to piece together/understand what's happening I'm thinking
    (1) adding samples increases evidence of a MNP at a loci (and so I suppose it could also decrease evidence of a MNP even if it's clearly present in one of the samples??), and
    (2) adding samples could result in multi-allelic (as opposed to biallelic) SNPs causing such loci to be ignored in MNP creation.

    Actually I guess I'm not totally clear on why just adding (but not calling) samples helps but adding and also calling samples "hurts" things.

    But anyway, multi-allelic MNPs are actually exactly what I'm interested in.
    I'm therefore thinking of supplying all the samples I have but using "--sampleToPhase" to generate MNPs on a sample by sample basis.
    Then I would merge individual sample vcfs back together (extending MNP runs with reference bases where appropriate) in order to produce multiallelic MNPs across samples.

    Does this seem like a reasonable approach?
    Or am I misunderstanding something here?

    Many, many thanks!!
    Scott

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @sls5775
    Hi Scott,

    Can you please post the exact commands you ran for all three examples?

    Thanks,
    Sheila

  • sls5775sls5775 UT, AustinMember
    edited June 2016

    Hi Sheila!

    So I previously used an older version of GATK to produce vcf files of ~485 samples.
    I then created one and two sample vcf files as follows:

    grep ^## cohort.1_410.vcf > cohort.1b.vcf
    grep ^#CHROM cohort.1_410.vcf | cut -f1-9,46 >> cohort.1b.vcf
    grep -v ^# cohort.1_410.vcf | cut -f1-9,46 >> cohort.1b.vcf
    grep ^#CH cohort.1b.vcf
    #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT A_237_RE1_R12

    grep ^## cohort.1_410.vcf > cohort.1-2b.vcf
    grep ^#CHROM cohort.1_410.vcf | cut -f1-9,46,48 >> cohort.1-2b.vcf
    grep -v ^# cohort.1_410.vcf | cut -f1-9,46,48 >> cohort.1-2b.vcf
    grep ^#CHROM cohort.1-2b.vcf
    #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT A_237_RE1_R12 A_239_RE1_R12

    The reference variables are as follows:
    refDir=/work/03055/sls5775/project.Genomes_universalJob/Panicum_virgatum_V2_release/Panicum_virgatum/sequences
    ref=Panicum_virgatum.mainGenome.scaffolds.fasta

    So, what I see is, for just a single sample, including "--sampleToPhase" changes the phasing/merging output:

    java -Xmx2G -jar ~/GenomeAnalysisTK.jar -T ReadBackedPhasing -enableMergeToMNP -maxDistMNP 1000 --sampleToPhase A_237_RE1_R12 -R $refDir/$ref --variant cohort.1b.vcf -L Chr01a -I A_237_RE1_R12.sorted.RALN.bam -o cohortPHASED.1.Chr01a.vcf

    Sample: A_237_RE1_R12 Sites tested: 850 Sites phased: 599 Phase-inconsistent sites: 142 [phased: 134, unphased:8]

    java -Xmx2G -jar ~/GenomeAnalysisTK.jar -T ReadBackedPhasing -enableMergeToMNP -maxDistMNP 1000 -R $refDir/$ref --variant cohort.1b.vcf -L Chr01a -I A_237_RE1_R12.sorted.RALN.bam -o cohortPHASED.1.Chr01a.vcf

    Sample: A_237_RE1_R12 Sites tested: 478 Sites phased: 334 Phase-inconsistent sites: 67 [phased: 58, unphased:9]

    Contrary to the first time I did this, adding a second sample does not seem to greatly affect things (although note that the results are slightly different upon adding a second sample). [I apologize, I am unsure what I did differently the first time around to produce my previous result.]

    java -Xmx2G -jar ~/GenomeAnalysisTK.jar -T ReadBackedPhasing -enableMergeToMNP \
    -maxDistMNP 1000 --sampleToPhase A_237_RE1_R12 \
    -R $refDir/$ref --variant cohort.1-2b.vcf -L Chr01a -I A_237_RE1_R12.sorted.RALN.bam \
    -I A_239_RE1_R12.sorted.RALN.bam -o cohortPHASED.1w2.Chr01a.vcf

    Sample: A_237_RE1_R12 Sites tested: 850 Sites phased: 598 Phase-inconsistent sites: 142 [phased: 134, unphased:8]

    java -Xmx2G -jar ~/GenomeAnalysisTK.jar -T ReadBackedPhasing -enableMergeToMNP \
    -maxDistMNP 1000 \
    -R $refDir/$ref --variant cohort.1-2b.vcf -L Chr01a -I A_237_RE1_R12.sorted.RALN.bam \
    -I A_239_RE1_R12.sorted.RALN.bam -o cohortPHASED.1w2.Chr01a.vcf

    Sample: A_237_RE1_R12 Sites tested: 486 Sites phased: 346 Phase-inconsistent sites: 71 [phased: 62, unphased:9]
    Sample: A_239_RE1_R12 Sites tested: 625 Sites phased: 451 Phase-inconsistent sites: 87 [phased: 80, unphased:7]

    So now I have the feeling that adding multiple samples works just as expected, but the difference in behavior as a result of the "--sampleToPhase" flag is still throwing me off...

    What do you think??

    Thanks!!
    Scott

    Issue · Github
    by Sheila

    Issue Number
    1018
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @sls5775
    Hi Scott,

    I tried testing this on some of my own data, and I'm not seeing the exact same thing you are seeing. But, let me talk to the team and get back to you.

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Scott @sls5775,

    It's possible that the older version you were using before had some bugs that were subsequently resolved.

    Regarding the behavior of the --sampleToPhase argument, as I recall its intent is limit phasing at the sites where there is a variant call (ie a genotype other than hom-ref) in the sample named by the argument. That's why you see different summary numbers depending on the sample you look at -- some may have more or fewer variant calls. Does that clarify things?

Sign In or Register to comment.