VariantRecalibrator - no data found

tgenahmettgenahmet Posts: 4Member

I just updated to the latest nightly and got the same error:

INFO 12:03:16,652 VariantRecalibratorEngine - Finished iteration 45. Current change in mixture coefficients = 0.00258 INFO 12:03:23,474 ProgressMeter - GL000202.1:10465 5.68e+07 32.4 m 34.0 s 98.7% 32.9 m 25.0 s INFO 12:03:32,263 VariantRecalibratorEngine - Convergence after 46 iterations! INFO 12:03:41,008 VariantRecalibratorEngine - Evaluating full set of 4944219 variants... INFO 12:03:41,100 VariantDataManager - Training with worst 0 scoring variants --> variants with LOD <= -5.0000.

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.IllegalArgumentException: No data found. at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:83) at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:392) at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:138) at org.broadinstitute.sting.gatk.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:116) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:107)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version nightly-2014-03-20-g65934ae):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: No data found.
ERROR ------------------------------------------------------------------------------------------

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    What's your command line? Can you post your full output log?

    Geraldine Van der Auwera, PhD

  • tgenahmettgenahmet Posts: 4Member

    Attached. It contains the command line as well.

    txt
    txt
    GhanaTNBC_0010_1_SA_Whole_C1_KAWGL_J00022-GhanaTNBC_0010_1_BE_Whole_T2_KAWGL_J00023.UG.vcf.varRecalibrator.txt
    18K
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Thanks. I see you're running with --mode BOTH, which is unsupported and goes against our recommendations. This may not be the cause of the issue you encountered, but you'll need to try again in SNP or INDEL mode before I can help you.

    Geraldine Van der Auwera, PhD

  • noushin6noushin6 Baltimore, MDPosts: 14Member

    I am getting an identical error message with almost identical command line usage and using --mode SNP. Is there any way I can get some debugging help from you? I am running gatk version 3.1-1-g07a4bf8.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    @noushin6, can you please post your command lines?

    Geraldine Van der Auwera, PhD

  • noushin6noushin6 Baltimore, MDPosts: 14Member
    edited April 1

    Sure. Here is my commandline:

    java -Xmx${heap}m -jar ${gatk}\
     -T VariantRecalibrator\
     -R ${refSequence}\
     -input ${SCRATCH}/${sample}.raw_variants.vcf\
     -resource:hapmap,known=false,training=true,truth=true,prior=15.0 ${trHAPMAP}\
     -resource:omni,known=false,training=true,truth=true,prior=12.0 ${trOMNI}\
     -resource:1000G,known=false,training=true,truth=false,prior=10.0 ${tr1KG}\
     -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 ${trDBSNP}\
     -an DP\
     -an QD\
     -an FS\
     -an MQRankSum\
     -an ReadPosRankSum\
     -mode SNP\
     -tranche 100 -tranche 99.9 -tranche 99.0 -tranche 90.0 \
     -recalFile ${SCRATCH}/${sample}.recalibrate_SNP.recal\
     -tranchesFile ${SCRATCH}/${sample}.recalibrate_SNP.tranches\
         -rscriptFile ${SCRATCH}/${sample}.recalibrate_SNP_plots.R
    

    The variables point to corresponding paths, as the line above is a segment from a makefile.

    Thank you!

    Post edited by noushin6 on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    And what is your ApplyRecalibration command line?

    Geraldine Van der Auwera, PhD

  • noushin6noushin6 Baltimore, MDPosts: 14Member
    edited April 1

    Here it my ApplyRecalibration command line:

    java -Xmx${heap}m -Djava.io.tmpdir=${temp_folder}_snp_recal\                                                                                    
        -jar ${gatk}\                                                                                                                                  
        -T ApplyRecalibration\                                                                                                                         
        -R ${refSequence}\                                                                                                                             
        -input ${SCRATCH}/${sample}.raw_variants.vcf\                                                                                                  
        -mode SNP\                                                                                                                                     
        --ts_filter_level 99.0\                                                                                                                        
        -recalFile ${SCRATCH}/${sample}.recalibrate_SNP.recal\                                                                                         
        -tranchesFile ${SCRATCH}/${sample}.recalibrate_SNP.tranches\                                                                                   
        -o ${SCRATCH}/${sample}.recalibrate_snps_raw_indels.vcf
    

    This is my next step after VariantRecalibrator call above that fails. I am trying to follow the steps in http://www.broadinstitute.org/gatk/guide/topic?name=tutorials.

    Thanks!

    Post edited by noushin6 on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Mmkay, not seeing much -- can you post the log file for one run? Are you running recalibration per sample? This is not our recommended workflow...

    Geraldine Van der Auwera, PhD

  • noushin6noushin6 Baltimore, MDPosts: 14Member

    Do you mean the log file for one run of VariantRecalibrator?

    I am possibly confused about the recommended workflow at this stage. Can you please point me to the proper section of documentation?

    I am planning to run HaplotypeCaller on my individual samples to generate the initial set of variant calls. Should I do a bam file merge from multiple samples before calling HaplotypeCaller? The experiment I am looking at has very few normal tissue bam files.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    I meant the log files for one run of VariantRecalibrator and the corresponding run of ApplyRecalibration. The point is to make sure that the inputs and outputs are matching up correctly.

    You have two possible workflow options for your experiment. One is to call variants on your samples all together, which produces a multisample VCF that you then put through VQSR, as described in the existing Best Practices document here.

    The second option is a brand new workflow which will replace the one I just described (we're still updating the docs). The idea is that instead of calling variants together on all samples, you do it per-sample, but in a special mode that produces GVCFs. The you run a new joint genotyping step on the GVFCs, which produces a regular multisample VCF, that you then put through VQSR. This allows you to bypass the performance issues associated with multisample calling. See here for more details.

    In any case you should not be running VQSR on individual samples, because that will cause your analysis to be underpowered. But keep in mind that unless you use the new workflow (with GVCFs and the additional joint genotyping step), you also can't run VQSR together on samples that were called separately.

    Let me know if you need any further clarification.

    Geraldine Van der Auwera, PhD

  • michael_recombinemichael_recombine nycPosts: 4Member

    Hi there,

    I am running into a very similar error to tgenahmet. I am trying to recalibrate some variants that I produced from a .bam file produced from two trusightone pair-ended reads. Attached is my command line and the output. Any help would be greatly appreciated. Could the error be due to the fact that I am trying to call these variants from only one .bam file?

    Thank you in advance!

    txt
    txt
    Variant_recalibrator_error.txt
    6K
  • SheilaSheila Broad InstitutePosts: 540Member, GATK Developer, Broadie, Moderator admin

    @michael_recombine

    Hi,

    Is this a single exome you are running on? It's not recommended to run on only one exome sample (WGS may be ok).

    If you only have one exome sample, you can use data from the 1000 Genomes project to beef up your data set. http://www.1000genomes.org/data

    Or, you can try to force the number of bad variants using minNumBadVariants (http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantrecalibration_VariantRecalibrator.html#--minNumBadVariants). But, adding more data is better.

    -Sheila

  • michael_recombinemichael_recombine nycPosts: 4Member

    I am only looking at one individuals exome so that could be the issue. My vcf file contains over 10k variants; what is typically a suitably large number of variants? The main reason why your suggestion is puzzling to me is that in the past when testing this pipeline I was getting an error prompting me to use -minNumBad to fix my error. However, now I am not getting that error and am instead just seeing: ##### ERROR MESSAGE: No data found.

    Any suggestions would be great. I am retrying with -minNumBad

    Thanks again!

  • michael_recombinemichael_recombine nycPosts: 4Member

    Just ran the data with -minNumBad set to 5000 and got the same error. Attached is my command line and output. Would you suggest just merging in some more samples to increase my amount of data? Thanks!

    txt
    txt
    VQSR_error_2.txt
    8K
  • SheilaSheila Broad InstitutePosts: 540Member, GATK Developer, Broadie, Moderator admin

    @michael_recombine

    Hi,

    Yes, the best thing to do is to use data from the 1000 Genomes project. Please find the data here: http://www.1000genomes.org/data

    -Sheila

  • michael_recombinemichael_recombine nycPosts: 4Member

    Hi @Sheila,

    Thanks for your help up to now. So I merged my vcf with the 1000 genomes vcf and then ran VQSR with for snp's and it was successful (FINALLY)! But when I tried to run the output through VQSR again for indels I received the same error I got before. This time around it found very few indels obviously. Is this because I chose 1000G as my additional data?

    Thanks again!

    txt
    txt
    Indel_ERROR.txt
    5K
  • SheilaSheila Broad InstitutePosts: 540Member, GATK Developer, Broadie, Moderator admin

    @michael_recombine

    Hi,

    Two things:

    1) You should not simply merge your vcfs with the 1000G vcfs. You should get the 1000G bams, run the calling pipeline to generate GVCFs, do joint genotyping on all gvcfs together, then you finally do VQSR. I realize this was not apparent in my original post, so I will be preparing a new article explaining this more clearly.

    2) Issues with indels are frequent because they are so much less frequent than SNPs. This is not caused by choosing 1000G. I do not know how much data you used, but you might need to use more data from 1000G. Our recommendation is to use 30 or more bams.

    Good luck!

    -Sheila

  • kamilo889kamilo889 kamilo889Posts: 6Member

    HI all I am just trying to run VQSR but appear the error no data found in variantRecalibrator, reading the comments seems to be because I'm performing an exome analysis from an individual sample ... and is not enough data to run the program right?... but isn't clear to me the the thing of "merge the file with the 1000G" can you help me a little bit more please

    Regards Camilo

  • SheilaSheila Broad InstitutePosts: 540Member, GATK Developer, Broadie, Moderator admin

    @kamilo889

    Hi Camilo,

    You are correct that 1 exome is not enough data for VQSR. Instead of VQSR, you can try hard flitering. Please read about hard filtering here: http://gatkforums.broadinstitute.org/discussion/2806/howto-apply-hard-filters-to-a-call-set

    If you want to use VQSR, you will first need to get more data from the 1000Genomes data webpage here: http://www.1000genomes.org/data

    You should get at least 30 or more bam files from samples that are genetically similar to your sample exome and run all of the steps involved in doing a joint analysis. Please read about how to do a joint analysis here: http://www.broadinstitute.org/gatk/guide/article?id=3893

    I hope this helps.

    -Sheila

  • sakornilovsakornilov Yale UniversityPosts: 4Member

    Hi, I'm actually running in the same issue with the SNP VSQR ran on a multisample vcf file produced by genotypegvcf. This is a set of 21 exomes (I know about the k~30 recommendation but 1000 Genomes do not have ethnically similar samples, and I'm mostly doing this to pilot the workflow).

    Here's the code

    java -jar /PATH/GenomeAnalysisTK.jar \
    -T VariantRecalibrator \
    -R /PATH/hg19index2.fa \
    -input XXX.vcf \
    -resource:hapmap,known=-false,training=true,truth=true,prior=15.0 /PATH/GATK_Resources/hapmap_3.3.hg19.sites.vcf \
    -resource:omni,known=false,training=true,truth=true,prior=12.0 /PATH/GATK_Resources/1000G_omni2.5.hg19.sites.vcf \
    -resource:1000G,known=false,training=true,truth=false,prior=10.0 /PATH/GATK_Resources/1000G_phase1.snps.high_confidence.hg19.sites.vcf \
    -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /PATH/GATK_Resources/dbsnp_138.hg19.vcf \
    -L /PATH/SeqCap_EZ_Exome_v2.bed \
    -an QD \
    -an MQ \
    -an MQRankSum \
    -an ReadPosRankSum \
    -titv 3 \
    -mode SNP \
    -tranche 100.0 -tranche 99.5 -tranche 99.0 -tranche 90.0 \
    -recalFile XXX_recalSNP.recal \
    -tranchesFile XXX.tranches \
    -rscriptFile XXX_recalSNP_plots.R \
    

    Which produces

    INFO 00:12:38,603 GenomeAnalysisEngine - Strictness is SILENT INFO 00:12:38,666 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 00:12:39,111 GenomeAnalysisEngine - Preparing for traversal INFO 00:12:39,119 GenomeAnalysisEngine - Done preparing for traversal INFO 00:12:39,120 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 00:12:39,120 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 00:12:39,120 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime INFO 00:12:39,125 TrainingSet - Found hapmap track: Known = false Training = true Truth = true Prior = Q15.0 INFO 00:12:39,126 TrainingSet - Found omni track: Known = false Training = true Truth = true Prior = Q12.0 INFO 00:12:39,126 TrainingSet - Found 1000G track: Known = false Training = true Truth = false Prior = Q10.0 INFO 00:12:39,126 TrainingSet - Found dbsnp track: Known = true Training = false Truth = false Prior = Q2.0 INFO 00:13:09,124 ProgressMeter - chr1:168282042 6525865.0 30.0 s 4.0 s 5.4% 9.3 m 8.8 m INFO 00:13:39,126 ProgressMeter - chr2:102475430 1.3439691E7 60.0 s 4.0 s 11.2% 8.9 m 7.9 m INFO 00:14:09,130 ProgressMeter - chr3:45857514 2.032475E7 90.0 s 4.0 s 17.2% 8.7 m 7.2 m INFO 00:14:39,133 ProgressMeter - chr4:32353684 2.7060788E7 120.0 s 4.0 s 23.0% 8.7 m 6.7 m INFO 00:15:09,135 ProgressMeter - chr5:37997884 3.3309406E7 2.5 m 4.0 s 29.3% 8.5 m 6.0 m INFO 00:15:39,138 ProgressMeter - chr6:34847769 3.972357E7 3.0 m 4.0 s 35.0% 8.6 m 5.6 m INFO 00:16:09,141 ProgressMeter - chr7:43252920 4.6076952E7 3.5 m 4.0 s 40.7% 8.6 m 5.1 m INFO 00:16:39,144 ProgressMeter - chr8:59083943 5.2572327E7 4.0 m 4.0 s 46.3% 8.6 m 4.6 m INFO 00:17:09,147 ProgressMeter - chr9:125329947 5.9087344E7 4.5 m 4.0 s 53.1% 8.5 m 4.0 m INFO 00:17:39,148 ProgressMeter - chr11:20077519 6.6175312E7 5.0 m 4.0 s 58.5% 8.5 m 3.5 m INFO 00:18:09,151 ProgressMeter - chr12:55433666 7.3219963E7 5.5 m 4.0 s 64.0% 8.6 m 3.1 m INFO 00:18:39,154 ProgressMeter - chr14:24684974 7.9821067E7 6.0 m 4.0 s 70.9% 8.5 m 2.5 m INFO 00:19:09,159 ProgressMeter - chr16:4650284 8.6967833E7 6.5 m 4.0 s 77.0% 8.4 m 116.0 s INFO 00:19:39,161 ProgressMeter - chr17:77877251 9.4673949E7 7.0 m 4.0 s 82.2% 8.5 m 91.0 s INFO 00:20:09,164 ProgressMeter - chr20:14506766 1.02436026E8 7.5 m 4.0 s 87.1% 8.6 m 66.0 s INFO 00:20:39,165 ProgressMeter - chrX:70830524 1.0971454E8 8.0 m 4.0 s 94.1% 8.5 m 30.0 s INFO 00:20:49,461 VariantDataManager - QD: mean = 17.71 standard deviation = 6.41 INFO 00:20:49,478 VariantDataManager - MQ: mean = 69.82 standard deviation = 1.66 INFO 00:20:49,485 VariantDataManager - MQRankSum: mean = 0.11 standard deviation = 0.69 INFO 00:20:49,494 VariantDataManager - ReadPosRankSum: mean = 0.41 standard deviation = 0.73 INFO 00:20:49,746 VariantDataManager - Annotations are now ordered by their information content: [MQ, QD, MQRankSum, ReadPosRankSum] INFO 00:20:49,756 VariantDataManager - Training with 74924 variants after standard deviation thresholding. INFO 00:20:49,759 GaussianMixtureModel - Initializing model with 100 k-means iterations... INFO 00:20:52,290 VariantRecalibratorEngine - Finished iteration 0. INFO 00:20:53,325 VariantRecalibratorEngine - Finished iteration 5. Current change in mixture coefficients = 2.03635 INFO 00:20:54,356 VariantRecalibratorEngine - Finished iteration 10. Current change in mixture coefficients = 4.81600 INFO 00:20:55,445 VariantRecalibratorEngine - Finished iteration 15. Current change in mixture coefficients = 0.23736 INFO 00:20:56,538 VariantRecalibratorEngine - Finished iteration 20. Current change in mixture coefficients = 0.23086 INFO 00:20:57,657 VariantRecalibratorEngine - Finished iteration 25. Current change in mixture coefficients = 0.16611 INFO 00:20:58,814 VariantRecalibratorEngine - Finished iteration 30. Current change in mixture coefficients = 0.02066 INFO 00:20:59,969 VariantRecalibratorEngine - Finished iteration 35. Current change in mixture coefficients = 0.00940 INFO 00:21:01,138 VariantRecalibratorEngine - Finished iteration 40. Current change in mixture coefficients = 0.00720 INFO 00:21:02,328 VariantRecalibratorEngine - Finished iteration 45. Current change in mixture coefficients = 0.00735 INFO 00:21:03,512 VariantRecalibratorEngine - Finished iteration 50. Current change in mixture coefficients = 0.00786 INFO 00:21:04,665 VariantRecalibratorEngine - Finished iteration 55. Current change in mixture coefficients = 0.01016 INFO 00:21:05,827 VariantRecalibratorEngine - Finished iteration 60. Current change in mixture coefficients = 0.01475 INFO 00:21:06,995 VariantRecalibratorEngine - Finished iteration 65. Current change in mixture coefficients = 0.01589 INFO 00:21:08,166 VariantRecalibratorEngine - Finished iteration 70. Current change in mixture coefficients = 0.02197 INFO 00:21:09,168 ProgressMeter - chrUn_gl000228:32096 1.1237226E8 8.5 m 4.0 s 100.0% 8.5 m 0.0 s INFO 00:21:09,354 VariantRecalibratorEngine - Finished iteration 75. Current change in mixture coefficients = 0.03203 INFO 00:21:10,537 VariantRecalibratorEngine - Finished iteration 80. Current change in mixture coefficients = 0.04255 INFO 00:21:11,729 VariantRecalibratorEngine - Finished iteration 85. Current change in mixture coefficients = 0.01019 INFO 00:21:12,933 VariantRecalibratorEngine - Finished iteration 90. Current change in mixture coefficients = 0.00551 INFO 00:21:14,137 VariantRecalibratorEngine - Finished iteration 95. Current change in mixture coefficients = 0.01607 INFO 00:21:15,338 VariantRecalibratorEngine - Finished iteration 100. Current change in mixture coefficients = 0.01040 INFO 00:21:16,546 VariantRecalibratorEngine - Finished iteration 105. Current change in mixture coefficients = 0.14741 INFO 00:21:17,770 VariantRecalibratorEngine - Finished iteration 110. Current change in mixture coefficients = 0.18376 INFO 00:21:18,524 VariantRecalibratorEngine - Convergence after 113 iterations! INFO 00:21:18,651 VariantRecalibratorEngine - Evaluating full set of 98477 variants... INFO 00:21:18,661 VariantDataManager - Training with worst 0 scoring variants --> variants with LOD <= -5.0000. INFO 00:21:19,564 GATKRunReport - Uploaded run statistics report to AWS S3

    ERROR ------------------------------------------------------------------------------------------
    ERROR stack trace

    java.lang.IllegalArgumentException: No data found. at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:83) at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:392) at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:138) at org.broadinstitute.gatk.engine.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:116) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:314) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107)

    ERROR ------------------------------------------------------------------------------------------
    ERROR A GATK RUNTIME ERROR has occurred (version 3.2-2-gec30cee):
    ERROR
    ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
    ERROR If not, please post the error message, with stack trace, to the GATK forum.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: No data found.
    ERROR ------------------------------------------------------------------------------------------

    I would really appreciate any and all help on this...

    Thanks!

    Best,

    Sergey

  • sakornilovsakornilov Yale UniversityPosts: 4Member

    I forgot to add that the vcf file was produced with the -allSites flag.

  • KurtKurt Posts: 161Member ✭✭✭

    @sakornilov,

    Try taking out;

    -an MQ

    from your command line.

  • sakornilovsakornilov Yale UniversityPosts: 4Member

    Hm..that actually worked. Thanks, Kurt! (but.. why did it not work with -an MQ?)

    @Kurt said: sakornilov,

    Try taking out;

    -an MQ

    from your command line.

  • KurtKurt Posts: 161Member ✭✭✭

    @sakornilov

    There are a couple of threads/posts/whatever on here somewhat recently about it crashing/destabilizing the model. The gist of it is that most of the MQ for the data (so far in my experience it has been on any targeted capture projects, exome or otherwise) is heavily skewed at 60 (presumably when using BWA) and HC filters anything where MQ<20. So in effect, MQ ends up not being informative in the model for these types of projects, at least in the new-ish HC GVCF joint calling workflow (it did work in older workflows). It does still work for whole genomes in my experience however.

  • sakornilovsakornilov Yale UniversityPosts: 4Member

    @‌Kurt

    Your explanation is much appreciated (as is GATK Developer Team's hard work).

  • KurtKurt Posts: 161Member ✭✭✭

    @sakornilov

    Oh, I'm not a GATK developer. Just a VERY appreciative end-user.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    And a VERY helpful one too, Kurt :)

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.