GATK VQSR on indels

Hi

Are there any precautions to be taken before applying VQSR to an "INDEL" VCF?
meaning, how does the algorithm inside VQSR recognize whether an indel in the input VCF is the
same as in the training dataset?
Is there some re-alignment, or equivalent representation to be done on either the input VCF,
training VCF or both prior to running the VCF through the VQSR command line?

Thanks
Severine

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Severine, at present the evaluation is done solely on the site coordinates. In future we'll have some allele-based functionality.

  • severinecseverinec Member

    Hi Geraldine
    thanks.
    Hmm. that means that 2 "equivalent" indels may not be necessarily recognised as such by the VQSR algorithm.
    Any pre-processing steps you could suggest I try running, on both the input and training VCF, to alleviate this issue?

    Severine

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    By convention GATK outputs indels left-aligned, and all our resource files are left-aligned, so as long as you keep it in the GATK family, so to speak, the risk of that being a problem is negligible. If you're using callsets generated elsewhere then you may benefit from left-aligning them to be sure they follow the same conventions. There's a utility tool in GATK to do this, check the variant manipulation tools list.

  • severinecseverinec Member

    because currently, without any pre-processing steps, the indel ROC looks much worse after VQSR than before VQSR.
    Meanwhile, with the SNP ROC, it is the opposite: the SNP ROC is clearly improved after VQSR compared to before
    (which is expected).
    So, the question is, is it expected to see degradation in the INDEL ROC after VQSR, compared to before VQSR?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    In general, no. How are you calculating the indel ROC? What kind of dataset (WGS or Ex, # of samples) are you working with?

  • severinecseverinec Member

    ok.

    I am using VCF out of GATK-HC, and the resource files are from GATK bundle, so it seems it should all be left-aligned
    consistent then.

    My test is using a single sample whole genome run...

    Severine

  • severinecseverinec Member

    the ROC is obtained via RTG vcfeval, and the sample is NA12878 so the comparison is done against the NIST
    high confidence call set.
    So everything is pretty vanilla. not sure where the problem is.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I see, the thing is, a single whole genome is the bare minimum for VQSR to build its model. And for indels that tends to be too little data for the model to be really robust. You should expect to see better results with larger sample numbers.

  • severinecseverinec Member

    ok, i will rerun by adding indels from other samples, under the --aggregate option. will let you know if that helps.
    thanks very much for your quick answers.
    S.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Severine, we don't recommend using the aggregate option for this. The right way to do it is to add samples at the joint calling (GenotypeGVCFs) stage.

  • severinecseverinec Member

    ok, yes, I just tried adding one additional WGS indels in the --aggregate, but the ROC is still bad.

    I can try using the joint genotyper. In this case, should I use a trio WGS joint VCF file as input to VQSR?

  • severinecseverinec Member

    BTW, Geraldine, tu parles francais?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Yes, you can use a trio joint WGS as input to VQSR. That should work well. If you still have trouble with indels you can reduce the number of Gaussian clusters that the tool will try to find for the indel case. See the VQSR method article for troubleshooting recommendations.

    Oui je parle français -- je suis belge francophone. Et toi?

  • severinecseverinec Member

    je vais te contacter par email.

    I ran the trio joint WGS, the indel ROC is much improved,but still not quite as good as prefilter.
    I will rerun with maxGaussian set to 4. and I may try with 5WGS, to see if i continue seeing improvements.

    how about this parameter: badLodCutoff, should i try tuning this too?
    another thing I observed, now that I have switched to a joint VCF is that the SNP ROC looks like a "double ROC",
    i.e., it has a knee and saturates, then starts a new knee and saturates again.
    I wonder what this is due to. Have you guys seen this?

    Merci!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I don't remember ever seeing two knees in a ROC curve, that sounds weird.

    Can you post the figures and detail exactly what is the grounds used for calculating the curve?

  • severinecseverinec Member

    it does sounds weird. Let me investigate further to see if it's something due to our own post-processing script.
    If i still don't find anything, I'll get back to you.

    BTW, I also notice some variant records which are missing the VQSLOD score. is that expected?
    those without a score get ignored from the ROC

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @severinec
    Hi,

    Can you post some example records that do not have the VQSLOD score?

    Thanks,
    Sheila

  • severinecseverinec Member

    Hi Sheila

    I isolated the root cause of the issue. I was trying to isolate the SNP and INDEL from the input VCF to process them
    separately, but I found that it's better to just use the original input VCF in both the SNP and INDEL VQSR command lines.
    When I don't split, then i don't miss any VQSLOD score.

    Thanks!
    Severine

  • severinecseverinec Member

    Hi Geraldine
    I have done several VQSR tests and here is the status on the 2 issues I am tracking:

    1) Post VQSR INDEL ROC performance,versus prefilter ROC
    I finally got a better ROC but only after i did 2 things: add more samples (3 WGS was good) AND upgrade the GATK version.
    3.1 was not sufficient to get a better ROC, but I got it with 3.5.

    2) the double-knee in the SNP ROC
    this one is still a mystery. the prefilter ROC does not have such behavior, but the post VQSR ROC one does.
    I see it with both versions 3.1 and 3.5.
    so I would like to understand this better.
    what do you need me to upload to debug further?

    • the input joint VCF (with 3 WGS)?
    • the VQSR command line?
    • the post VQSR joint VCF (with VQSLOD)?
    • the ROC plot (attached)
      how can i upload files to you guys?

    Thanks
    Severine

    Issue · Github
    by Sheila

    Issue Number
    865
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
  • severinecseverinec Member

    Hi Sheila

    how can i upload the files please?
    Thanks
    Severine

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @severinec
    Hi Severine,

    Instructions are here.

    Thanks,
    Sheila

  • severinecseverinec Member

    Sheila/Geraldine

    Thanks for the instructions. I uploaded the following tar file to the ftp at Broad:
    Severine_double_knee/ftpgatk.tar.gz

    it contains a ReadMe.txt, the joint VCF (with 3 WGS) pre and post VQSR filtering,
    all command lines in a text file, and the plot of the SNP ROC as produced by RTG vcfeval.

    I would like to understand where the double-knee in the SNP ROC comes from, and
    what settings I need to tune, to make the curve look like the same shape as the prefilter one.

    Many thanks!!
    Severine

  • severinecseverinec Member

    Hi

    We have found that the "double-knee" behavior in the SNP ROC post VQSR is due to the fact that VQSLOD
    is non-continuously distributed. When plotting the histogram, it shows distinct peaks of values, but in between these peaks
    there are gaps. This means that there are some ranges of values that VQSLOD does not take at all.
    Do you guys know of any settings that would make the VQSLOD distribution more evenly distributed, i.e., smoother?
    do we need to reduce the number of Gaussians?

    Thanks
    Severine

    Issue · Github
    by Sheila

    Issue Number
    893
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @severinec
    Hi Severine,

    It seems vcfeval tool uses different annotation values to create the ROC curves. What annotation did you use to make the "before filtering" ROC curve?
    And, I am assuming the "after filtering" plot is based on the VQSLOD value?

    Thanks,
    Sheila

  • severinecseverinec Member

    "before" filtering, i used QUAL. after filtering, yes, I used VQSLOD.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @severinec, this gave my team a lot of food for thought, so thank you for bringing this up. You're correct that the VQSLOD scores are not continuously distributed. This is an expected side effect of how they are generated; because variants that are close to a particular Gaussian clusters will tend to have very similar VQSLOD scores, whereas variants inhabiting the space between Gaussians will tend to have more unique VQSLOD scores. Decreasing the number of Gaussians would probably accentuate this effect. Anyway this is a property of the model, so trying to make the curve smooth just for the sake of it would not be appropriate. I had not recognized this initially because we don't normally plot VQSLOD scores; our typical approach is to look at tranches rather than the VQSLOD scores themselves (where the tranches are intended to capture ranges of VQSLODs based on sensitivity to the truthset).

    As far as I'm concerned everything makes sense now and you should be all set, but let us know if there is anything still bugging you about this.

  • zejunyanzejunyan EdinburghMember

    @Geraldine_VdAuwera

    Hi, this is yan from Edinburgh. I would like to thank you again for your workshop at Edinburgh back in July this year.

    I am doing hard-filtering on my own data. What I have done so far is:

    1. I managed to combine my SNP callset with the callset from SNP database

    2. I managed to plot density graphs using the R script you provide

    3. Now, I am testing my hard-filtered callsets using rtg vcfeval. However, I need refSDF for this evaluation. I am working on chicken genome, do you know where I can get SDF for chicken reference genome? By the way, what is SDF? Do I really need it for this step?

    Bests

    Dr yan

  • zejunyanzejunyan EdinburghMember

    @Geraldine_VdAuwera

    I used combined unfiltered variants to plot density plots for various annotations with the aim of determining a filtering threshold for each annotation. However, in all the plots I have plotted, there is only variants from gatk shown, is this expected? I do have 'intersections' under set between my gatk callsets and the SNP database, but the intersection is not shown on the graph.

    I would like to show the plots, but it did allow me to upload the files.

    Thank you for your time

    Dr Yan

  • zejunyanzejunyan EdinburghMember

    @Geraldine_VdAuwera

    It seems that I created the SDF of chicken genome using rtg format. I hope that you would do the same to get SDF of chicken genome.
    Now, I can execute rtg vcfeval. But I do not understand what is the input for --sample option. Basically, i do not know this explanation from (rtg vcfeval -h): the name of the sample to select. Use , to select different sample names for baseline and calls. I my case, my GGVCF joined 11 samples.

    Bests

    Dr yan

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
    edited August 2017

    @zejunyan
    Hi Dr. Yan,

    We cannot help you find an SDF reference for chicken. We only provide it for human in the tutorial. SDF is a type of file format that is optimized for large data. You do need the SDF format for running RTG tools.

    You will only see the variants from GATK, because the sites not called by GATK may not have the annotations you are trying to plot.

    I think you will need to choose one sample at a time to run on. In your case, you will have to run 11 times for the the 11 different samples.

    -Sheila

    EDIT: Glad to see you were able to make an SDF for chicken reference :smile:

  • zejunyanzejunyan EdinburghMember

    @sheilao
    Thank you for your reply.
    I realised that my SNP database (downloaded from ensemble and NCBI) is not annotated as GATK does i.e. snps in these two databases do not have annotations for the filters I want tp use for hard filtering. So, this does not allow me to make the density plots that have both gatk-called snps and database snps plotted for each filter so that I can get a good threshold value.

    My question is: is it still possible to get a good threshold value from the density plot that only has gatk-snps plotted? Or do you have other suggestions on how to filter out bad 'snps'?

    Bests

    Dr Yan

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @zejunyan
    Hi Dr. Yan,

    Yes, you can get a good threshold using just the GATK called variants. In fact, that is all you need. In your case, you would like to filter out "bad variants" from your GATK callset. The variants that are not in the GATK callset, but are in the truth dataset do not need to be filtered out. So, it does not matter that they are not in your plots. Does this make sense?

    For determining thresholds, have a look at the GATK filtering tutorial here and the Methods and Algorithms article here.

    -Sheila

  • zejunyanzejunyan EdinburghMember

    @Sheila

    Thank you for your suggestion, which is very helpful.

    Because I do not want to lose my chicken-line specific SNPs, I need to keep those SNPs that are not intersection with SNPs in database from Ensemble. So, I just do hard filtering on SNPs that are not intersection with database SNPs. How do you think about this idea?

    Because I called and genotyped SNPs on a trio family (11 chickens), what I found was that not all SNPs were fully genotyped. By this I mean, some genotyped SNPs have QD values AS NA, other SNPs have MQ values as NA. So, do you think it is a good ideas that I should remove SNPs that are not fully genotyped first? i.e. SNPs have missing values (NA) in some of their annotation fields. Then, I work on SNPs that are fully genotyped?

    After hard filtering on those SNPs, i can merge these hard-filtered SNPs with SNPs that ate intersection with database SNPs.

    Another question is:

    when I did CalculateGenotypePosterior
    [java -jar ../../tools/GenomeAnalysisTK.jar -T CalculateGenotypePosteriors -R ../../galGal5/galGal5_TopLevel/Gallus_gallus.Gallus_gallus-5.0.dna.toplevel.fa -ped trio.ped -V IntersectionSNP.vcf -o recalibratedVariants.postCGP.vcf]

    do I need to supply the -supporting? If so, what should I supply here? I execute this command line without --supporting field, is this right?

    Cheers

    Dr yan

  • zejunyanzejunyan EdinburghMember

    @Sheila

    My final question is:

    When I tried to hard-filter with MQ annotation, it gave me this error:

    NFO 17:22:24,231 HelpFormatter - ---------------------------------------------------------------------------------------------
    INFO 17:22:24,234 HelpFormatter - The Genome Analysis Toolkit (GATK) vnightly-2017-03-17-ga4f7203, Compiled 2017/03/17 00:01:14
    INFO 17:22:24,234 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
    INFO 17:22:24,234 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk
    INFO 17:22:24,234 HelpFormatter - [Mon Aug 28 17:22:24 BST 2017] Executing on Linux 4.4.0-83-generic amd64
    INFO 17:22:24,234 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_112-b15
    INFO 17:22:24,239 HelpFormatter - Program Args: -T VariantFiltration -R ../../galGal5/galGal5_TopLevel/Gallus_gallus.Gallus_gallus-5.0.dna.toplevel.fa -V ./gatk_1317911_ReadPosRankSum_PASS.vcf --filterExpression MQ < 10.0 --filterName basic_filters -o gatk_1317911_MQ.vcf
    INFO 17:22:24,242 HelpFormatter - Executing as [email protected] on Linux 4.4.0-83-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_112-b15.
    INFO 17:22:24,242 HelpFormatter - Date/Time: 2017/08/28 17:22:24
    INFO 17:22:24,243 HelpFormatter - ---------------------------------------------------------------------------------------------
    INFO 17:22:24,243 HelpFormatter - ---------------------------------------------------------------------------------------------
    INFO 17:22:24,287 GenomeAnalysisEngine - Strictness is SILENT
    INFO 17:23:09,376 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
    INFO 17:24:07,745 GenomeAnalysisEngine - Preparing for traversal
    INFO 17:24:07,783 GenomeAnalysisEngine - Done preparing for traversal
    INFO 17:24:07,784 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
    INFO 17:24:07,786 ProgressMeter - | processed | time | per 1M | | total | remaining
    INFO 17:24:07,787 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime

    ERROR --
    ERROR stack trace

    java.lang.NumberFormatException: For input string: "inf"
    at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
    at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
    at java.lang.Double.parseDouble(Double.java:538)
    at org.apache.commons.jexl2.JexlArithmetic.toDouble(JexlArithmetic.java:1016)
    at org.apache.commons.jexl2.JexlArithmetic.compare(JexlArithmetic.java:699)
    at org.apache.commons.jexl2.JexlArithmetic.lessThan(JexlArithmetic.java:774)
    at org.apache.commons.jexl2.Interpreter.visit(Interpreter.java:967)
    at org.apache.commons.jexl2.parser.ASTLTNode.jjtAccept(ASTLTNode.java:18)
    at org.apache.commons.jexl2.Interpreter.interpret(Interpreter.java:232)
    at org.apache.commons.jexl2.ExpressionImpl.evaluate(ExpressionImpl.java:65)
    at htsjdk.variant.variantcontext.JEXLMap.evaluateExpression(JEXLMap.java:178)
    at htsjdk.variant.variantcontext.JEXLMap.get(JEXLMap.java:94)
    at htsjdk.variant.variantcontext.JEXLMap.get(JEXLMap.java:15)
    at htsjdk.variant.variantcontext.VariantContextUtils.match(VariantContextUtils.java:338)
    at org.broadinstitute.gatk.tools.walkers.filters.VariantFiltration.matchesFilter(VariantFiltration.java:483)
    at org.broadinstitute.gatk.tools.walkers.filters.VariantFiltration.buildVCfilters(VariantFiltration.java:474)
    at org.broadinstitute.gatk.tools.walkers.filters.VariantFiltration.filter(VariantFiltration.java:379)
    at org.broadinstitute.gatk.tools.walkers.filters.VariantFiltration.map(VariantFiltration.java:318)
    at org.broadinstitute.gatk.tools.walkers.filters.VariantFiltration.map(VariantFiltration.java:99)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)
    at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
    at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
    at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:98)
    at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:316)
    at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:123)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
    at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)

    ERROR ------------------------------------------------------------------------------------------
    ERROR A GATK RUNTIME ERROR has occurred (version nightly-2017-03-17-ga4f7203):
    ERROR
    ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
    ERROR If not, please post the error message, with stack trace, to the GATK forum.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions https://software.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: For input string: "inf"

    What does this mean?

    Cheers

    Dr yan

  • zejunyanzejunyan EdinburghMember

    @Sheila

    I am sorry to bother very much.

    I was trying to filter "lowGQ" after done CalculateGenotypePosterior in the GeneRefinement Workflow, and i got this message:

    Interpreter - ![0,2]: 'GQ < 20;' undefined variable GQ

    So, does this mean it is not filtering my vcf file based on GQ?

    Bests

    Dr yan

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
    edited August 2017

    @zejunyan
    Hi Dr. Yan,

    For the filtering questions in your first post, it may be best to refer to the filtering tutorial here. That tutorial is an older version, but explains the sets from CombineVariants and how they all relate to each other.

    How did you produce the final trio VCF? Did you follow the GVCF workflow?

    For CalculateGenotypePosteriors, you can input a supporting VCF with known chicken variants if you have them. If you do not have a VCF with known variants, then you can simply run with the ped file and --skipPopulationPriors.

    After you get back on how you produced the VCF, I can help with your second question. It looks like the NA for MQ is causing the error. Are you using the latest version? I don't think you should get the error if you are following Best Practices.

    Interpreter - ![0,2]: 'GQ < 20;' undefined variable GQ simply means that at a particular position, the GQ annotation is not present. The tool skips over those sites that are missing the annotations you request.

    -Sheila

  • zejunyanzejunyan EdinburghMember

    @Sheila

    Hi, Sheila

    To produce the final trio VCF, I followed exactly the GVCF workflow. So, I called variants for each individual in my trio in GVCF mode, I then combined all gVCF (11 of them in my trio), followed by genotyping this combined gVCF. This workflow generates my final trio VCF.

    I am using GATK version 3.7, which should be ok, right?

    Bests

    Yan

  • zejunyanzejunyan EdinburghMember

    @Sheila

    Interpreter - ![0,2]: 'GQ < 20;' undefined variable GQ simply means that at a particular position, the GQ annotation is not present. The tool skips over those sites that are missing the annotations you request.

    If so, do you think it would be a good idea to remove these positions that do not have GQ annotation?

    Bests

    Yan

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @zejunyan
    Hi Yan,

    Hmm. Can you try validating your input VCF with ValidateVariants? If that validates with no errors, can you try using version 3.8 for the VariantFiltration step?

    You do not need to remove the sites with no GQ annotation. There may be other useful annotations at those positions that can be used in filtering.

    -Sheila

  • zejunyanzejunyan EdinburghMember

    @Sheila

    Hi, Sheila

    I am trying to validate my callsets by ValidateVariants as well as evaluate my callsets and VariantEval, but I kept getting the following error:

    Input files /gensys/projects/NGS/zyan/snp_db/ensemble_version/new_version.Gallus_gallus.vcf and reference have incompatible contigs. Please see https://software.broadinstitute.org/gatk/documentation/article?id=63for more information. Error details: The contig order in /gensys/projects/NGS/zyan/snp_db/ensemble_version/new_version.Gallus_gallus.vcf and reference is not the same; to fix this please see: (https://www.broadinstitute.org/gatk/guide/article?id=1328), which describes reordering contigs in BAM and VCF files..

    I followed this instruction to re-order my input file ( new_version.Gallus_gallus.vcf ) using picard SortVcf, the problem is still there.

    The contig order in my input VCF and reference genome:

    BEFORE SORTING:

    **##### ERROR /gensys/projects/NGS/zyan/snp_db/ensemble_version/new_version.Gallus_gallus.vcf contigs = [1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 20, 21, 22, 23, 24, 25, 26, 27, 28, 3, 30, 31, 32, 33, 4, 5, 6, 7, 8, 9, LGE64, MT, W, Z] (This is my snp database downloaded from ensemble)

    ERROR reference contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 30, 31, 32, 33, W, Z, MT, LGE64, KQ759483.1, KQ759505.1, KQ759404.1, AADN04000541.1, KQ759506.1, KQ759543.1, KQ759405.1, AADN04000593.1, KQ759507.1, AADN04000612.1, KQ759527.1, AADN04000623.1, KQ759508.1, AADN04000624.1, AADN04000628.1, AADN04000631.1, KQ759484.1, AADN04000639.1, AADN04000640.1, KQ759509.1, AADN04000646.1, KQ759528.1, KQ759510.1, AADN04000659.1, AADN04000661.1, AADN04000662.1, AADN04000668.1, KQ759529** [part of contigs]

    AFTER SORTING:

    **comp contigs = [1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 20, 21, 22, 23, 24, 25, 26, 27, 28, 3, 30, 31, 32, 33, 4, 5, 6, 7, 8, 9, AADN04000541.1, AADN04000547.1, AADN04000550.1, AADN04000593.1, AADN04000609.1, AADN04000612.1, AADN04000621.1, AADN04000623.1, AADN04000624.1, AADN04000628.1, AADN04000631.1, AADN04000639.1, AADN04000640.1, AADN04000643.1, AADN04000646.1, AADN04000659.1, AADN04000660.1, AADN04000661.1, AADN04000662.1, AADN04000664.1, AADN04000668.1, AADN04000674.1, AADN04000677.1, AADN04000682.1, AADN04000703.1, AADN04000704.1, AADN04000705.1, AADN04000706.1, AADN04000708.1, AADN04000710.1, AADN04000711.1, AADN04000712.1, AADN04000717.1, AADN04000725.1, AADN04000729.1, AADN04000731.1, AADN04000732.1, AADN04000735.1, AADN04000736.1, AADN04000737.1, AADN04000743.1, AADN04000744.1, AADN04000745.1, AADN04000750.1, AADN04000751.1, AADN04000752.1, AADN04000753.1, AADN04000754.1, AADN04000767.1, AADN04000772.1, AADN04000776.1, AADN04000777.1, AADN04000780.1, AADN04000785.1, AADN04000787.1, AADN04000788.1, AADN04000789.1, AADN04000790.1, AADN04000791.1, [part of contigs]

    ERROR reference contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 30, 31, 32, 33, W, Z, MT, LGE64, KQ759483.1, KQ759505.1, KQ759404.1, AADN04000541.1, KQ759506.1, KQ759543.1, KQ759405.1, AADN04000593.1, KQ759507.1, AADN04000612.1, KQ759527.1, AADN04000623.1, KQ759508.1, AADN04000624.1, AADN04000628.1, AADN04000631.1, KQ759484.1, AADN04000639.1, AADN04000640.1, KQ759509.1, AADN04000646.1, KQ759528.1, KQ759510.1, AADN04000659.1, AADN04000661.1, AADN04000662.1, AADN04000668.1, KQ759529.1, KQ759530.1, AADN04000677.1, AADN04000682.1, KQ759486.1, KQ759511.1, KQ759531.1, KQ759532.1, KQ759480.1, KQ759534.1, KQ759533.1, AADN04000704.1, AADN04000706.1, AADN04000708.1, AADN04000710.1, AADN04000712.1, AADN04000711.1, AADN04000717.1, KQ759535.1, KQ759536.1, AADN04000729.1, AADN04000732.1, AADN04000736.1, AADN04000737.1, KQ759537.1, AADN04000744.1, AADN04000745.1, KQ759334.1, KQ759538.1, AADN04000750.1, AADN04000751.1, AADN04000752.1, AADN04000753.1, KQ759539.1, AADN04000754.1, KQ759487.1, KQ759540.1, AADN04000767.1, KQ759541.1, KQ759466.1, AADN04000772.1, KQ759365.1, AADN04000776.1, AADN04000777.1, KQ759542.1, AADN04000780.1, AADN04000785.1, AADN04000787.1, AADN04000789.1, AADN04000788.1, AADN04000790.1, AADN04000792.1, KQ759392.1, AADN04000799.1, AADN04000798.1, KQ759335.1, AADN04000801.1, AADN04000802.1, AADN04000803.1, KQ759544.1, KQ759454.1, KQ759545.1, KQ759481.1, AADN04000814.1, AADN04000815.1, KQ759488.1, KQ759546.1, AADN04000827.1, AADN04000828.1, AADN04000829.1, KQ759489.1, AADN04000831.1, AADN04000832.1, AADN04000833.1, AADN04000834.1, AADN04000835.1, AADN04000839.1, KQ759547.1, AADN04000838.1, KQ759548.1, AADN04000843.1, KQ759549.1, AADN04000848.1,**[part of contigs]

    It seems that my input VCF had new contigs added in an ordered manner, but thoese are not in the same order as the reference. In addition, the order of old contigs in my input snp database VCF ([1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 20, 21, 22, 23, 24, 25, 26, 27, 28, 3, 30, 31, 32, 33, 4, 5, 6, 7, 8, 9, LGE64, MT, W, Z]) is still the same after sorting, this order is different from that in reference ([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 30, 31, 32, 33, W, Z, MT, LGE64, .....]

    My callsets were called using this reference genome, so the contig order in my callsets should be the same as the reference, right?

    Thank you so much for your help!!!!

    Bests

    Dr yan

  • zejunyanzejunyan EdinburghMember

    @Sheila
    This is my commandline:

    -T VariantEval -R ../../galGal5/galGal5_TopLevel/Gallus_gallus.Gallus_gallus-5.0.dna.toplevel.fa -eval ./gatk_1317911.vcf -comp /gensys/projects/NGS/zyan/snp_db/ensemble_version/sorted.new_version.Gallus_gallus.vcf -o ./Unfiltered.VariantEval.txt -noEV -EV ValidationReport

  • zejunyanzejunyan EdinburghMember

    @Sheila

    For SortVcf, my commandline is

    java -jar ../../tools/picard-tools-1.141/picard.jar SortVcf I=/gensys/projects/NGS/zyan/snp_db/ensemble_version/new_version.Gallus_gallus.vcf O=./sorted.new_version.Gallus_gallus.vcf SEQUENCE_DICTIONARY=../../galGal5/galGal5_TopLevel/Gallus_gallus.Gallus_gallus-5.0.dna.toplevel.dict

  • zejunyanzejunyan EdinburghMember

    @Sheila

    When using gatk-3.8, it reports such error:

    ERROR StatusLogger Unable to create class org.apache.logging.log4j.core.impl.Log4jContextFactory specified in jar:file:/mnt/Pella/Processing/NGS-working/zyan/tools/GenomeAnalysisTK-3.8-0-ge9d806836/GenomeAnalysisTK.jar!/META-INF/log4j-provider.properties
    ERROR StatusLogger Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using SimpleLogger to log to the console...

    gatk-3.7 does not have this problem.

    Could you tell me what does it mean? I have no idea. :(

    Bests

    Dr yan

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @zejunyan
    Hi Dr. Yan,

    Can you try deleting the VCF index after running SortVcf? I think there was a bug that should be fixed in the latest version (2.11.0).

    For the StatusLogger error, that is a bug that was introduced in 3.8. I think you can ignore the error if your run completes with no error message. But, if it hangs after the error message, you can try with the latest nightly build. Also, have a look at this thread for more information.

    -Sheila

Sign In or Register to comment.