Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Duplicate allele error during LiftoverVcf run

Hey folks

I'm trying to run VQSR on some data that was aligned to b37 (a couple years ago). The reference files have moved and the best practices have changed since they were first posted for b37, so I thought it might be easiest to run LiftoverVcf to align it to hg38 and run VQSR with the most recent hg38 files. I downloaded the chain file for b37 to hg19, and the lift over ran fine; then I tried to lift over from hg19 to hg38, and that's giving me errors.

I think I'm using the latest version of Picard, 2.18.29, Java version 1.8.0_201 (and gatk 4.1.0.0 if that's relevant).

This is WES + capture kit data.

The command I used was:
java -jar ~/tools/picard.jar LiftoverVcf I=MyData.vcf O=MyData_lifted_over.vcf CHAIN=hg19ToHg38.over.chain REJECT=rejected_variants.vcf R= hg38.fa

The error is:
Exception in thread "main" java.lang.IllegalArgumentException: Duplicate allele added to VariantContext: C
at htsjdk.variant.variantcontext.VariantContext.makeAlleles(VariantContext.java:1493)
at htsjdk.variant.variantcontext.VariantContext.<init>(VariantContext.java:379)
at htsjdk.variant.variantcontext.VariantContextBuilder.make(VariantContextBuilder.java:579)
at htsjdk.variant.variantcontext.VariantContextBuilder.make(VariantContextBuilder.java:573)
at picard.util.LiftoverUtils.liftVariant(LiftoverUtils.java:117)
at picard.vcf.LiftoverVcf.doWork(LiftoverVcf.java:396)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:295)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)

I then ran it again with a slightly different reference file and got something similar:

Exception in thread "main" java.lang.IllegalArgumentException: Duplicate allele added to VariantContext: A
at htsjdk.variant.variantcontext.VariantContext.makeAlleles(VariantContext.java:1493)
at htsjdk.variant.variantcontext.VariantContext.<init>(VariantContext.java:379)
at htsjdk.variant.variantcontext.VariantContextBuilder.make(VariantContextBuilder.java:579)
at htsjdk.variant.variantcontext.VariantContextBuilder.make(VariantContextBuilder.java:573)
at picard.util.LiftoverUtils.liftVariant(LiftoverUtils.java:117)
at picard.vcf.LiftoverVcf.doWork(LiftoverVcf.java:396)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:295)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)

If I narrow down the top portion of the file to see where the error appears, I think it's this line:

chr1 144931739 rs375230102 AGGC *,AGGCGGC,A 2312.56 PASS AC=2,9,9;AF=4.950e-03,0.022,0.022;AN=404;BaseQRankSum=-2.890e-01;ClippingRankSum=0.289;DB;DP=5719;FS=0.751;InbreedingCoeff=-1.3149;MLEAC=2,9,7;MLEAF=4.950e-03,0.022,0.017;MQ=60.00;MQ0=0;MQRankSum=0.300;QD=5.02;ReadPosRankSum=0.286;SOR=0.800

I'm not sure what's wrong here. I'll also give you a few lines before and after in case I screwed that up somehow:

chr1 144931607 rs6673292 C T 492.12 PASS AC=3;AF=7.426e-03;AN=404;BaseQRankSum=-1.204e+00;ClippingRankSum=0.00;DB;DP=6458;FS=1.721;InbreedingCoeff=-0.0075;MLEAC=3;MLEAF=7.426e-03;MQ=60.00;MQ0=0;MQRankSum=0.012;QD=4.21;ReadPosRankSum=0.450;SOR=0.906
chr1 144931699 rs144526186 T A 192.12 PASS AC=2;AF=4.950e-03;AN=404;BaseQRankSum=2.13;ClippingRankSum=0.477;DB;DP=6505;FS=2.847;InbreedingCoeff=-0.0050;MLEAC=2;MLEAF=4.950e-03;MQ=60.00;MQ0=0;MQRankSum=-1.090e-01;QD=3.00;ReadPosRankSum=0.065;SOR=1.259
chr1 144931727 rs2985363 G A 28702.22 PASS AC=126;AF=0.312;AN=404;BaseQRankSum=0.299;ClippingRankSum=0.301;DB;DP=6107;FS=0.594;InbreedingCoeff=-0.4553;MLEAC=126;MLEAF=0.312;MQ=60.00;MQ0=0;MQRankSum=0.025;QD=7.51;ReadPosRankSum=0.586;SOR=0.790
chr1 144931737 rs765186109 CGAG C,GGAG 431.87 PASS AC=2,1;AF=4.950e-03,2.475e-03;AN=404;BaseQRankSum=-6.060e-01;ClippingRankSum=-6.270e-01;DB;DP=5793;FS=3.081;InbreedingCoeff=-0.0074;MLEAC=2,1;MLEAF=4.950e-03,2.475e-03;MQ=60.00;MQ0=0;MQRankSum=0.209;QD=5.47;ReadPosRankSum=0.079;SOR=1.055
chr1 144931739 rs375230102 AGGC *,AGGCGGC,A 2312.56 PASS AC=2,9,9;AF=4.950e-03,0.022,0.022;AN=404;BaseQRankSum=-2.890e-01;ClippingRankSum=0.289;DB;DP=5719;FS=0.751;InbreedingCoeff=-1.3149;MLEAC=2,9,7;MLEAF=4.950e-03,0.022,0.017;MQ=60.00;MQ0=0;MQRankSum=0.300;QD=5.02;ReadPosRankSum=0.286;SOR=0.800
chr1 144935104 rs147352020 A G 10599.84 PASS AC=5;AF=0.012;AN=404;BaseQRankSum=0.748;ClippingRankSum=-1.760e-01;DB;DP=48219;FS=0.000;InbreedingCoeff=-0.0125;MLEAC=5;MLEAF=0.012;MQ=60.00;MQ0=0;MQRankSum=-5.640e-01;QD=4.39;ReadPosRankSum=1.48;SOR=0.709
chr1 144935209 . T A 1324.16 PASS AC=1;AF=2.475e-03;AN=404;BaseQRankSum=2.60;ClippingRankSum=0.378;DP=47081;FS=15.309;InbreedingCoeff=-0.0025;MLEAC=1;MLEAF=2.475e-03;MQ=60.00;MQ0=0;MQRankSum=0.633;QD=4.09;ReadPosRankSum=-7.390e-01;SOR=1.570
chr1 144935261 . A G 1243.16 PASS AC=1;AF=2.475e-03;AN=404;BaseQRankSum=0.902;ClippingRankSum=-1.088e+00;DP=46942;FS=2.215;InbreedingCoeff=-0.0025;MLEAC=1;MLEAF=2.475e-03;MQ=60.00;MQ0=0;MQRankSum=1.60;QD=5.63;ReadPosRankSum=1.42;SOR=0.917

If someone could point me in the right direction it would be a big help.

Thanks!

David



PS -- When I only run the first part of the file, I process 74348 variants and find that 99.6584% of variants were successfully lifted over and written to the output. I also get a few warnings of the type "Interval chr1:120583691-120583697 failed to match chain 2 because intersection length 4 < minMatchSize 7.0 (0.5714286 < 1.0)" -- I see several references to this type of warning but couldn't find the meaning, could someone point me to that?

Answers

  • davidsiegeldavidsiegel Member
    And to be clear the reference file I used was the one that gave me the VariantContext: A error.
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @davidsiegel

    You may find this thread helpful as well.

  • davidsiegeldavidsiegel Member
    Hi @bhanuGandham

    I saw that thread but the reference allele is not one of the variant alleles in any of the lines above (as far as I can tell). The reference allele is AGGC, the variants are *,AGGCGGC,A.
  • davidsiegeldavidsiegel Member
    It's been a couple weeks since my initial post, is this really a new bug?
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @davidsiegel

    Sorry about the delay in getting back to you. Our team was busy with a GATK workshop, hence the delay. I am looking into this issue for you and will get back to you soon.

    Again I apologize for any inconvenience.

  • davidsiegeldavidsiegel Member
    Thanks. In the meantime I wrote a script to repeatedly run the liftover to find and delete the problem variants if the output was an error -- it found 15 problem variants out of 1.3 million (and liftover now runs successfully on the rest). This took a few hours to run, less than a day. In the end 99.7% of variants were successfully lifted over -- roughly 0.2% had a mismatched reference allele.
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @davidsiegel

    Can you please share your input, vcf and the 15 problem variants. I am curious to see why this might have happened. Here is a doc with details on how to share data with the GATK team: https://software.broadinstitute.org/gatk/guide/article?id=1894

  • davidsiegeldavidsiegel Member
    I will when I have a chance, some maintenance is being done on our system so I can't mess with the files right now.
  • davidsiegeldavidsiegel Member
    It's uploaded as das_upload.tar.gz

    I included two example files, one that runs without error ("vcf_with_23_good_variants.vcf") and one that also has 13 bad variants at the end ("vcf_with_23_good_variants_and_13_bad.vcf"). I wasn't sure what you needed so I just uploaded everything required to run the command.
  • davidsiegeldavidsiegel Member
    @bhanuGandham (see above comment, not sure if the "@" symbol is necessary)
  • davidsiegeldavidsiegel Member
    Hi @bhanuGandham

    Another question I have is regarding the "MismatchedRefAllele" error -- what exactly causes this? For example, I have this allele in hg19:

    chr1 1577003 rs3819995 C T 35937.41 MismatchedRefAllele

    so the ref is C and the alt is T; whereas in hg38 the ref is T and the alt is C for the same rsID. Is there any way to know what caused this problem? Did the rsID change, or is it just the reference allele? This only happened for a couple thousand alleles (0.2% of the total), but it would be nice to know if this is something we can improve upon.

    Thanks,

    David
  • davidsiegeldavidsiegel Member
    (I'm led to believe that the reference allele doesn't often change, but it would be nice to know if there's a database or something of those occurrences -- I used the UCSC liftover file so I'm not sure why I'm not covered here)
Sign In or Register to comment.