Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

[LeftAlignAndTrimVariants]

Hi,

I've a set of variants below that can be left aligned but are not left aligned by GATK:

20:1348896:CTGGAATATGACTC,C,CTGGAATACGACTC
20:3321991:CA,A
20:8257660:ATTG,G
20:8712622:ATA,A,TTA
20:10880228:TCT,T,TTT
20:11349916:ATA,ACA,A
20:11891803:AG,G
20:14195865:TATTT,TCTAT,T,TATAT
20:17458895:TAAT,TAT,T
20:21556925:CTTC,CCTC,C
20:22452520:AGA,ATA,A
20:23435839:TGT,T,AGT
20:24412174:CAGAC,CAGCC,C
20:25975887:AGTAA,A,AGAAA
20:30405420:TTCT,T,TTTT
20:31011363:TATGT,TATTT,T
20:31150901:GAGGGTG,G,GAGGCTG
20:33972794:AAAGA,AAAAA,A,AAAA
20:34548750:CTTCTC,C,CTTCCC
20:35137684:AAACA,A,AAATA
20:35209242:ACA,AAA,AA,A
20:35486864:AAAACACACACA,AA,AAA,AACACACACACA,A
20:36217443:TGT,T,TTT
20:37219146:CACTC,C,CACAC
20:37219156:CTC,C,CAC
20:37660012:TGT,TCT,T
20:39588205:AAAGAA,AAAGAAGGAGGGAAGGAAGGAAGGA,A
20:40518952:TGTTCT,TGTTTT,T
20:41512412:CCTC,CCC,C
20:43896621:AAGAAAGAG,G
20:45900810:GAG,GG,G
20:46077766:TT,T,TTTTAT
20:46540182:TT,CT,T
20:47226805:CTTTC,C,CTTCC
20:53097535:ATCTATCA,ATCTATCTA,ATCTA,A,ATCAATCA,CTCTATCTA
20:54194204:AGA,ACA,A
20:54904549:AAGA,AAAA,AAA,AA,A
20:56069216:TTTGTGT,TT,TGTGTGT,T
20:56125852:AGA,AAA,A
20:56998532:CATC,AATC,C
20:57308854:TAC,C
20:57651565:TGT,TTT,TT,T
20:58144616:TGGATGGACGGAT,T,TGGATGGATGGAT
20:58454940:ATAAA,ATATA,A
20:58523103:AGA,AAA,AA,AAAA,A,GAA
20:58919551:TT,CT,T
20:58919555:TT,TTTCT,T
20:60137192:GTG,CTG,G,GGG
20:60343357:GTAG,G,ATAG
20:60414491:GTTTG,GTG,G
20:60418564:ATA,A,AGTGAGACA
20:60712179:GTAGCAG,G,GCAGCAG
20:61180608:TCTAT,TCCAT,T
20:61408578:AGGA,GGGA,A,AGAA
20:61621530:CGGGGTGGCCCGGCTGGCATTGCCTTCTCCTAACGTTCCTGGGGTGGCCCGGCTGGCATTGCCTTCTCCTAACGTTCCT,T
20:62045114:ACA,A,GCA
20:62074175:CATC,CACC,C

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    How exactly do you think these should be aligned?

  • atksatks Member
    edited May 2014

    The => points it to the way it should be left aligned.

    20:1348896:CTGGAATATGACTC,C,CTGGAATACGACTC => 20:1348895:CCTGGAATATGACT,C,CCTGGAATACGACT
    20:3321991:CA,A => 20:3321988:TC,T
    20:8257660:ATTG,G => 20:8257633:CTTA,C
    20:8712622:ATA,A,TTA => 20:8712621:CAT,C,CTT
    20:10880228:TCT,T,TTT => 20:10880227:TTC,T,TTT
    20:11349916:ATA,ACA,A => 20:11349915:CAT,CAC,C
    20:11891803:AG,G => 20:11891793:GA,G
    20:14195865:TATTT,TCTAT,T,TATAT => 20:14195864:ATATT,ATCTA,A,ATATA
    20:17458895:TAAT,TAT,T => 20:17458894:TTAA,TTA,T
    20:21556925:CTTC,CCTC,C => 20:21556924:CCTT,CCCT,C
    20:22452520:AGA,ATA,A => 20:22452519:CAG,CAT,C
    20:23435839:TGT,T,AGT => 20:23435838:TTG,T,TAG
    20:24412174:CAGAC,CAGCC,C => 20:24412173:CCAGA,CCAGC,C
    20:25975887:AGTAA,A,AGAAA => 20:25975886:CAGTA,C,CAGAA
    20:30405420:TTCT,T,TTTT => 20:30405419:CTTC,C,CTTT
    20:31011363:TATGT,TATTT,T => 20:31011362:TTATG,TTATT,T
    20:31150901:GAGGGTG,G,GAGGCTG => 20:31150900:GGAGGGT,G,GGAGGCT
    20:33972794:AAAGA,AAAAA,A,AAAA => 20:33972793:AAAAG,AAAAA,A,AAAA
    20:34548750:CTTCTC,C,CTTCCC => 20:34548749:CCTTCT,C,CCTTCC
    20:35137684:AAACA,A,AAATA => 20:35137683:TAAAC,T,TAAAT
    20:35209242:ACA,AAA,AA,A => 20:35209241:AAC,AAA,AA,A
    20:35486864:AAAACACACACA,AA,AAA,AACACACACACA,A => 20:35486863:GAAAACACACAC,GA,GAA,GAACACACACAC,G
    20:36217443:TGT,T,TTT => 20:36217442:TTG,T,TTT
    20:37219146:CACTC,C,CACAC => 20:37219145:ACACT,A,ACACA
    20:37219156:CTC,C,CAC => 20:37219155:ACT,A,ACA
    20:37660012:TGT,TCT,T => 20:37660011:CTG,CTC,C
    20:39588205:AAAGAA,AAAGAAGGAGGGAAGGAAGGAAGGA,A => 20:39588204:GAAAGA,GAAAGAAGGAGGGAAGGAAGGAAGG,G
    20:40518952:TGTTCT,TGTTTT,T => 20:40518951:TTGTTC,TTGTTT,T
    20:41512412:CCTC,CCC,C => 20:41512411:CCCT,CCC,C
    20:43896621:AAGAAAGAG,G => 20:43896569:GAGAAAGAA,G
    20:45900810:GAG,GG,G => 20:45900809:GGA,GG,G
    20:46077766:TT,T,TTTTAT => 20:46077765:TT,T,TTTTTA
    20:46540182:TT,CT,T => 20:46540181:CT,CC,C
    20:47226805:CTTTC,C,CTTCC => 20:47226804:CCTTT,C,CCTTC
    20:53097535:ATCTATCA,ATCTATCTA,ATCTA,A,ATCAATCA,CTCTATCTA => 20:53097534:TATCTATC,TATCTATCT,TATCT,T,TATCAATC,TCTCTATCT
    20:54194204:AGA,ACA,A => 20:54194203:CAG,CAC,C
    20:54904549:AAGA,AAAA,AAA,AA,A => 20:54904548:AAAG,AAAA,AAA,AA,A
    20:56069216:TTTGTGT,TT,TGTGTGT,T => 20:56069215:ATTTGTG,AT,ATGTGTG,A
    20:56125852:AGA,AAA,A => 20:56125851:AAG,AAA,A
    20:56998532:CATC,AATC,C => 20:56998531:ACAT,AAAT,A
    20:57308854:TAC,C => 20:57308836:CAT,C
    20:57651565:TGT,TTT,TT,T => 20:57651564:TTG,TTT,TT,T
    20:58144616:TGGATGGACGGAT,T,TGGATGGATGGAT => 20:58144615:GTGGATGGACGGA,G,GTGGATGGATGGA
    20:58454940:ATAAA,ATATA,A => 20:58454939:TATAA,TATAT,T
    20:58523103:AGA,AAA,AA,AAAA,A,GAA => 20:58523102:AAG,AAA,AA,AAAA,A,AGA
    20:58919551:TT,CT,T => 20:58919550:CT,CC,C
    20:58919555:TT,TTTCT,T => 20:58919554:CT,CTTTC,C
    20:60137192:GTG,CTG,G,GGG => 20:60137191:GGT,GCT,G,GGG
    20:60343357:GTAG,G,ATAG => 20:60343356:TGTA,T,TATA
    20:60414491:GTTTG,GTG,G => 20:60414489:GTGTT,GTG,G
    20:60418564:ATA,A,AGTGAGACA => 20:60418563:CAT,C,CAGTGAGAC
    20:60712179:GTAGCAG,G,GCAGCAG => 20:60712174:CAGCAGT,C,CAGCAGC
    20:61180608:TCTAT,TCCAT,T => 20:61180607:TTCTA,TTCCA,T
    20:61408578:AGGA,GGGA,A,AGAA => 20:61408577:AAGG,AGGG,A,AAGA
    20:61621530:CGGGGTGGCCCGGCTGGCATTGCCTTCTCCTAACGTTCCTGGGGTGGCCCGGCTGGCATTGCCTTCTCCTAACGTTCCT,T => 20:61621491:CGGGGTGGCCCGGCTGGCATTGCCTTCTCCTAACGTTCCCGGGGTGGCCCGGCTGGCATTGCCTTCTCCTAACGTTCCT,C
    20:62045114:ACA,A,GCA => 20:62045113:GAC,G,GGC
    20:62074175:CATC,CACC,C => 20:62074174:CCAT,CCAC,C
    
    Post edited by Geraldine_VdAuwera on
  • rpoplinrpoplin Member ✭✭✭

    I don't quite understand your notation. Are you showing ref allele, alt allele, alt allele?

    Thanks,

  • atksatks Member

    yes it is. sorry about the confusion.

  • rpoplinrpoplin Member ✭✭✭

    I see. The LeftAlignAndTrimVariants walker doesn't currently work with multiallelic mixed records like these.

    There is a note in the walker's documentation: Note that this tool cannot handle anything other than bi-allelic, simple indels.

    Take a look at the splitMultiallelics argument to see if that will help get towards what you want here.

    Cheers,

  • atksatks Member

    There are 6 biallelics that did not work too.

    20:3321991:CA,A => 20:3321988:TC,T
    20:8257660:ATTG,G => 20:8257633:CTTA,C
    20:11891803:AG,G => 20:11891793:GA,G
    20:43896621:AAGAAAGAG,G => 20:43896569:GAGAAAGAA,G
    20:57308854:TAC,C => 20:57308836:CAT,C
    20:61621530:CGGGGTGGCCCGGCTGGCATTGCCTTCTCCTAACGTTCCTGGGGTGGCCCGGCTGGCATTGCCTTCTCCTAACGTTCCT,T => 20:61621491:CGGGGTGGCCCGGCTGGCATTGCCTTCTCCTAACGTTCCCGGGGTGGCCCGGCTGGCATTGCCTTCTCCTAACGTTCCT,C

  • rpoplinrpoplin Member ✭✭✭

    The trouble with those records is that they aren't valid VCF records. The deletions should have padding bases. Where did the input file come from?

  • atksatks Member

    It's from the Boston College call set for 1000 Genomes.

    I think the variant is fine. When you refer to the padding base, do you mean that the first base for all alleles should be the same? This is a case of a deletion with an adjacent SNP that cosegregates with the deletion allele. I also just checked VCFv4.2, the ALT field does not require that the padding be the same.

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    I agree that the records are valid. But as @rpoplin‌ mentioned this tool does not try to fix multi-allelic or mixed variants. The records you posted are all either multi-allelic or mixed (which is defined as a mixture of types, as in the deletion with an adjacent SNP).

  • atksatks Member

    ok, sure. thank you. It might be a good idea to get it to fix multiallelics and mixed variants.

  • atksatks Member

    Another thing to note, although the 6 variants look like mixed variants, after left alignment, it is clear that they are actually simple indels. The purpose of raising these issues is to try and standardize the notion of left alignment/normalization.

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    Does a GATK tool exist for converting INDEL calls from other callers into the same format as that of GATK? Here is an example, where samtools and FreeBayes get it "wrong", whereas GATK "correctly" trims the REF:

    #CHROM  POS ID  REF ALT QUAL    FILTER  NA12878
    

    GATK HaplotypeCaller 3.2-2:

    20  10296272    .   CA  C   93908.20    .   1/1:0,4:.:12:114,12,0
    

    Real Time Genomics:

    20  10296272    .   CA  C   21889.3 PASS    1|1:35:0.695:0.077:0.000:1304.5:207:2.23:6.79:0.24:0.00:0.00:1,32:-130.45,-21.20,0.00:0.4640
    

    samtools0.x:

    20  10296272    .   CAAAAAAAAAA CAAAAAAAAA  999 1/1:35,12,0:4:4
    

    FreeBayes v9.9.2:

    20  10296272    .   CAAAAAAAAAAG    CAAAAAAAAAG 103.849 .   1/1:3:0:0:3:98:-10,-1.10663,0
    

    I will ask for the GATK format to be the only correct format in VCFv4.3. I will notify the samtools and FreeBayes developers. Thanks.

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    Erik Garrison was kind enough to point me to vcfallelicprimitives https://github.com/ekg/vcflib and vt normalize https://github.com/atks/vt. Please ignore my question.

  • atksatks Member
    edited September 2014

    GATK's leftAlignAndTrimVariants will work if you include the --trimAlleles option. It will however not properly left align multiallelics and some simple biallelic indels.

    bcftools norm implements the same algorithm as vt normalize, so you have another choice of tool there.

    vcfallelicprimitives decomposes a complex variant or MNP into its constituent SNP/Indels.

    vt decompose splits multiallellic variants into bialleic records.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    In a similar vein, we also have VariantsToAllelicPrimitives

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    Thanks @atks and @Geraldine_VdAuwera‌ for additional info! Using -T LeftAlignAndTrimVariants --trimAlleles does indeed work. I went for vcfallelicprimitives, which does the trick and allows me to pipe the output and the input. I was also afraid LeftAlignAndTrimVariants couldn't handle long INDELs, as other walkers (ValidateVariants) complain about long reference alleles.

    Dear GATK developers. Walkers allowing stdin input is at the top of my wish list and my birthday is end of November. Yeah, someone will be busy until then :) I know I'm not the only one:
    http://gatkforums.broadinstitute.org/discussion/3450/does-gatk-support-stdin

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    To be honest this is unlikely to happen in the "classic" java-based GATK, but I wouldn't be surprised if it was possible in the new C++ based GATK expansion that is currently under development and may be released in time for your birthday...

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    @atks said:
    GATK's leftAlignAndTrimVariants will work if you include the --trimAlleles option. It will however not properly left align multiallelics and some simple biallelic indels.

    bcftools norm implements the same algorithm as vt normalize, so you have another choice of tool there.

    vcfallelicprimitives decomposes a complex variant or MNP into its constituent SNP/Indels.

    vt decompose splits multiallellic variants into bialleic records.

    I had some problems with a few GT fields being changed by LeftAlignAndTrimVariants --splitMultiallelics (from known to unknown) and vcfallelicprimitives (from unknown to known).

    Only sites are output by vt decompose.

    Multiallelic sites are not split by vt normalize.

    I'm currently attempting to write my own code, but it quickly gets quite complicated for multiallelic sites, when for example one sample has GT 0/1 and another 1/2 at the same position.

  • atksatks Member

    I think LeftAlignAndTrimVariants is doing it correctly in that some GT fields are moved from known to unknown because, in some cases, the allele of interest is no longer represented in the ALT field.

    vt decompose does not support genotype field output at this time partly because of the above reason.

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    Thanks for your continuous comments @atks.

    I had a case of LAATV --splitMultiAllelics changing a REF=G, ALT=T,A, GT=0/1 call to REF=G, ALT=T, GT=./. and REF=G, ALT=A, GT=./.:
    http://gatkforums.broadinstitute.org/discussion/4646/leftalignandtrimvariants-splitmultiallelics-changes-gt-from-known-to-unknown

    I believe it should have been REF=G, ALT=T, GT=./. for the first of the two biallelic variants after splitting the multiallelic variant.

  • atksatks Member

    I agree that it should be 0/1 for the first of the splitted allele.

    I also need this feature to refine some analyses, so you can find the description of the updated version of decompose at http://genome.sph.umich.edu/wiki/Vt#Decompose. If there are any issues, please raise at the github issues page - https://github.com/atks/vt/issues

  • EvaEva Member

    Hi atks,
    vt decompose is giving me a Segmentation fault (core dumped) error. Can you please assist? I am invoking it the same way as is given at http://genome.sph.umich.edu/wiki/Vt#Decompose. Thank you.

  • EvaEva Member

    decompose was giving this message "[W::vcf_parse] contig 'gi|602625715|gb|AE004092.2|' is not defined in the header. (Quick workaround: index the file with tabix.)" and as atks suggested bgzip and tabix indexing corrected the error in case anybody will face the same problem in the future. Thank you atks for your help.

  • tinutinu Member
     java -Xmx4G -jar GATK_3.4/GenomeAnalysisTK.jar -T LeftAlignAndTrimVariants -R hs37d5.fa --variant minimal.vcf o minimal.LeftAlignAndSplit.vcf --splitMultiallelics --dontTrimAlleles
    ##### ERROR ------------------------------------------------------------------------------------------
    ##### ERROR A USER ERROR has occurred (version 3.4-0-g7e26428):
    ##### ERROR
    ##### ERROR This means that one or more arguments or inputs in your command are incorrect.
    ##### ERROR The error message below tells you what is the problem.
    ##### ERROR
    ##### ERROR If the problem is an invalid argument, please check the online documentation guide
    ##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
    ##### ERROR
    ##### ERROR Visit our website and forum for extensive documentation and answers to
    ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ##### ERROR
    ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
    ##### ERROR
    ##### ERROR MESSAGE: Argument with name 'dontTrimAlleles' isn't defined.
    ##### ERROR ------------------------------------------------------------------------------------------
    

    Why is this error ?
    I am using GATK 3.4

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    That argument only became available in version 3.5 -- which was released today.

  • tinutinu Member

    Thank you. Sure, would try with that

  • tinutinu Member

    Hi Geraldine,

    Tried with GATK3.5. I don't think it is still working in version 3.5

    java -Xmx4G -jar GATK_3.5/GenomeAnalysisTK.jar -R hs37d5.fa -T LeftAlignAndTrimVariants --variant  INPUT.vcf  -o INPUT.gatkLADT.vcf --dontTrimAlleles --splitMultiallelics
    

    Issue-1
    **Variant in INPUT.vcf **
    11 108139106 rs4987971 CTTAGTG TTTAGTG,C,GTTAGTG

    **Variant in INPUT.gatkLADT.vcf **
    11 108139106 rs4987971 C T
    11 108139106 rs4987971 CTTAGTG C
    11 108139106 rs4987971 C G

    Issue-2
    Also the Output VCF doesn't have AC, AN, AF values and genotypes are not retained

    Variant in INPUT.vcf **
    **#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1110

    2 48030645 rs63750998 C T,G 8307.65 . AC=0,1;AF=0.00,2.040e-04;AN=4902;B GT:AD:DP:GQ:PL 0/2:125,0,97:222:99:2605,2980,6790,0,3810,3519
    **
    Variant in INPUT.gatkLADT.vcf ****
    **#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1110

    2 48030645 rs63750998 C G 8307.65 . . GT:DP ./.:222
    2 48030645 rs63750998 C T 8307.65 . . GT:DP ./.:222

    Thanks,
    Tinu

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    That is the expected behavior.

Sign In or Register to comment.