Variant Recalibration failing

allyhumeallyhume United KingdomMember

I'm trying to run variant recalibration for SNP as part of the GATK best practice workflow for DNA sequence analysis but am getting the following error presumably because after training there are 0 variants with an LOD score <= -5.0.

The error is:

ERROR stack trace

org.broadinstitute.gatk.utils.exceptions.ReviewedGATKException: Unable to retrieve result
at org.broadinstitute.gatk.engine.executive.HierarchicalMicroScheduler.execute(HierarchicalMicroScheduler.java:190)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:314)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107)
Caused by: java.lang.IllegalArgumentException: No data found.
at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:83)
at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:392)
at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:138)
at org.broadinstitute.gatk.engine.executive.HierarchicalMicroScheduler.notifyTraversalDone(HierarchicalMicroScheduler.java:226)
at org.broadinstitute.gatk.engine.executive.HierarchicalMicroScheduler.execute(HierarchicalMicroScheduler.java:183)
... 5 more

I am using GATK 3.2-2. The full output from this stage is attached.

I also got a similar error with GATK 3.1. I have run it a few times now and although there are very slight variations in the numbers it outputs the result is consistently 0 variants with a LOD <= -5.0.

I have run this stage successfully when I process my reads in two separate chunks but when a combine them all into one processing pipeline they fail at this stage.

Any advice would be welcome on to analyse this and determine what I need to change or where I am going wrong.

Regards,

Ally

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @allyhume‌

    Hi Ally,

    This error typically happens when there are not enough variants, possibly because there are not enough samples being run together.

    What exactly do you mean when you say

    I have run this stage successfully when I process my reads in two separate chunks but when a combine them all into one processing pipeline they fail at this stage.

    Can you clarify the workflow? Thanks.

    -Sheila

  • allyhumeallyhume United KingdomMember

    Hi Shelia,

    Yes, I've seen a similar error before when there is little data but in this case I think there is plenty of data. The program reports a warning to see it has too much and has downsampled (see the attachment for full output):

    WARNING: Very large training set detected. Downsampling to 2500000 training variants.

    To explain what I meant before. I have four pairs of fastq files (pair ended) that together give about 30x coverage. If I process each pair totally separately then the full best practice pipeline works fine. If I combine two of them (at the mark duplicates stage) then again all works well. But when I combine all four of them to get the best coverage the the pipeline fails at the SNP variant recalibration stage.

    Ally

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Ally,

    Sorry for the delayed response, this is a bit of a stumper. We don't normally see issues with too much data. You can try raising the -maxNumTrainingData argument value, but we don't think that is the problem. It could be a problem with the data. Could you please post the recalibration plots you get from the separate runs?

  • allyhumeallyhume United KingdomMember

    Hi Geraldine,

    Sorry for not getting back to you sooner. Vacation and system maintenance got in the way.

    I've made various data files available to you that will hopefully allow you to
    investigate further way causes this. Please find these files at:
    http://www.epcc.ed.ac.uk/~ally/gatk/

    Firstly the .vcf (in gzip format) and corresponding .vcf.idx file that produce the error
    are available:
    ERR091787_ERR091788_ERR091789_ERR091790.genotyping.vcf.gz
    ERR091787_ERR091788_ERR091789_ERR091790.genotyping.vcf.idx (12M)

    As I said before, when I used half the available data the pipeline works fine with both
    halves. As requested, the PDFs produced by the SNP variant recalibration stage for each
    of these halves are in the following files:
    ERR091788_ERR091789.genotyping.variant_recalib_snp_1.R.pdf
    ERR091788_ERR091789.genotyping.variant_recalib_snp_1.tranches.pdf
    ERR091787_ERR091790.genotyping.variant_recalib_snp_1.R.pdf
    ERR091787_ERR091790.genotyping.variant_recalib_snp_1.tranches.pdf

    In case they are helpful the .vcf files that were successfully processed are also
    available:
    ERR091787_ERR091790.genotyping.vcf.gz
    ERR091788_ERR091789.genotyping.vcf.gz

    The command line used to produce the error is:

        java -Xmx500g -jar GenomeAnalysisTK.jar -T VariantRecalibrator  
        -R human_g1k_v37.fasta  
        -input ERR091787_ERR091788_ERR091789_ERR091790.genotyping.vcf  
        -recalFile ERR091787_ERR091788_ERR091789_ERR091790.genotyping.variant_recalib_snp_1.recal  
        -tranchesFile ERR091787_ERR091788_ERR091789_ERR091790.genotyping.variant_recalib_snp_1.tranches  
        -rscriptFile ERR091787_ERR091788_ERR091789_ERR091790.genotyping.variant_recalib_snp_1.R  
        -nt 16 
        -resource:hapmap,known=false,training=true,truth=true,prior=15.0 gatk-bundle/hapmap_3.3.b37.vcf  
        -resource:omni,known=false,training=true,truth=true,prior=12.0   gatk-bundle/1000G_omni2.5.b37.vcf  
        -resource:1000G,known=false,training=true,truth=false,prior=10.0 gatk-bundle/1000G_phase1.snps.high_confidence.b37.vcf  
        -resource:dbsnp,known=true,training=false,truth=false,prior=2.0  gatk-bundle/dbsnp_138.b37.vcf  
        -an QD  -an MQ  -an MQRankSum  -an ReadPosRankSum  -an FS  -an DP  -mode SNP
    

    Hopefully this will help you to understand what has happened. Thank you for helping me with this.

    Regards,

    Ally

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @allyhume‌

    Hi Ally,

    Can you please upload the files to our ftp server? It makes debugging easier for us. Instructions on how to do so are here: http://www.broadinstitute.org/gatk/guide/article?id=1894

    Thanks,
    Sheila

  • allyhumeallyhume United KingdomMember

    Sheila,

    I've uploaded all the files mentioned above to the ftp server in file 15122.zip

    Thanks,

    Ally

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Thanks Ally, we've replicated your bug and filed a support ticket for this. It's in the hands of the devs now; we'll let you know when we have a solution.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @allyhume‌

    Hi Ally,

    I just heard back from the developer, and he thinks this is a problem with the data, specifically MQ.

    The MQ distribution is off -- the vast majority of the variants have MQ=60.00, and some actually go above that. Are you mixing read groups aligned with different aligners? The tight band at MQ60 seems to be screwing up the model.

    If you remove MQ from the training, the results look good. MQ just does not seem to be informative for your data.

    I hope this helps.

    -Sheila

  • KurtKurt Member ✭✭✭

    @Sheila‌ Just FYI, I commented on this, but the post is awaiting approval (either b/c I put a link in there or b/c I attached a file...don't know which)...in short, I see the same thing with adding MQ to VQSR for SNPs in exomes when using bwa mem (and also have a vary narrow distribution at MQ60).

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @Kurt, I verified your account so this won't happen again, and I'll rescue your comment from moderation limbo now.

  • KurtKurt Member ✭✭✭

    Huh, I had/have a similar issue with processing exomes through VQSR and it looks like this is same reason. I use bwa-mem 0.7.8 with default settings (link to my previous post below). Good to know the reason now :)

    http://gatkforums.broadinstitute.org/discussion/4273

    attached is my distribution (X axis is MQ rounded to the nearest integer, Y axis is count of SNV sites).

  • KurtKurt Member ✭✭✭

    Thanks @Geraldine_VdAuwera‌ , I reattached my distribution plot since it looks like that didn't come through.

  • byb121byb121 UKMember
    edited October 2014

    Hi @Kurt‌ , same problem here, so what's the solution? I mean if there is still a way of using MQ for the recalibration. Tweak BWA? Thanks,

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @byb121‌ It looks like this effect is popping up a lot (partly due to HC's specific filtering, and to improvements in sequencing quality/mapping), and the conclusion is simply that when it happens, it renders MQ uninformative for VQSR, so MQ should be dropped. It's not a cause for particular concern. We'll see if we can tweak VQSR to handle this more smartly than by flipping out, and/or still derive some information from MQ, in a future version.

  • byb121byb121 UKMember

    @Geraldine_VdAuwera‌ Thank you very much for the reply. The thing is when recalibrating with fewer variants, there is no error or warning, but in fact the recalibration looks messed by MQ. Should the user be warned when a NOT informative annotation is used?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @byb121‌ It's difficult to do this dynamically (because it requires adding some advanced logic for the program to recognize that an annotation is uninformative in a potentially detrimental way) but we'll see what we can do. In the meantime we'll try to document the potential issues with MQ better.

  • mmokrejsmmokrejs Czech RepublicMember
    edited May 2017

    @Geraldine_VdAuwera
    Is this documented already somewhere? I don't see in VQSR documentation that I should drop for my exome analysis not only DP but also MQ from the command line if recal charts seem weird. I use The Genome Analysis Toolkit (GATK) v3.7-0-gcfedb67, Compiled 2016/12/12 11:21:18 now.

    How does this relate to MQ being 75 or 70 after BQSR? Elsewhere you discuss one needs to change some stuff (forgot what exactly and where) to cope with MAPQ>60 after the BQSR step.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @mmokrejs
    Hi,

    The [MQ Jitter function] (which I think you used) deals with the issue in this thread. Have a look here for more information.

    -Sheila

  • mmokrejsmmokrejs Czech RepublicMember
    edited May 2017

    Right, I had also tried VariantRecalibrator -an MQ -MQCap 70 on the GVCF files derived by HaplotypeCaller from BAM files pre-processed with IndelRealigner + BQSR. I was inspired by https://software.broadinstitute.org/gatk/blog?id=7847 . However, addition of the two arguments had no effect IMHO. I thought maybe the MAPQ0 to MAPQ19 changes are the culprit as I do not actually understand what the article says about these (notably the first two tables shown in the article are cryptic to me).

    Therefore I even went to use less derived BAM files (only processed to remove duplicates and being realigned in sync with the tumor sample from the same individual). Again, the -an MQ -MQCap 70 had no effect.

    I did not use --MQCapForLogitJitterTransform on the commandline but if I understand the thread correctly it is was anyway used.

    I did not user either HaplotypeCaller --read-filter ReassignOriginalMQAfterIndelRealignment with followup VariantRecalibrator -MQCap 60 and I probably won't because I would have yet another BAM file on my disks. BTW, I recalculated the normal+tumor sample pairs using: GenomeAnalysisTK.jar -T IndelRealigner -R "$reference_flatfile" -I "$peripheral_blood_bam" -I "$bone_marrow_bam" $known_indel_calls -targetIntervals "$sample_prefix"."$aligner""$postalt"removed_duplicates.forIndelRealigner.intervals --consensusDeterminationModel KNOWNS_ONLY --bam_compression 9 -LOD 0.4 -nWayOut '.realignedtogether.bam'

    I would say in general, GATK should carefully inspect BAM/VCF header and try to ensure user fed the program with meaningful input data. It is terribly inefficient if users even have to change default argument values just to follow up the Best practices way. That should be the default, automagically.

    I do not see a reason why GATK could not decide on its own if -MQCap 70 or -MQCap 60 should be applied during VariantRecalibrator step and consult for that commandline from the input VCF file stating whether --MQCapForLogitJitterTransform was already applied or not. In summary, I propose a lot more safety checks in the code.

    I will probably re-create the per-sample GVCF files from the original BAM files as output from bwa in ALT-aware mode and see if that makes magically a difference.

  • mmokrejsmmokrejs Czech RepublicMember

    I will probably re-create the per-sample GVCF files from the original BAM files as output from bwa in ALT-aware mode and see if that makes magically a difference.

    So using BAM fiels without IndelRealigner processing does not help either. Below I show their difference to those after IndelRealigner processing. Seems barely a difference although I do not have a complete picture of the changes.

    - mysample.bwa.postalt.removed_duplicates.HaplotypeCaller.g.vcf
    + mysample.bwa.postalt.removed_duplicates.realignedtogether.HaplotypeCaller.g.vcf
    @@ -7192,7 +7192,7 @@
     chr1   1013480 .       T       <NON_REF>       .       .       END=1013480     GT:DP:GQ:MIN_DP:PL      0/0:18:48:18:0,48,720
     chr1   1013481 .       G       <NON_REF>       .       .       END=1013486     GT:DP:GQ:MIN_DP:PL      0/0:18:51:18:0,51,765
     chr1   1013487 .       T       <NON_REF>       .       .       END=1013489     GT:DP:GQ:MIN_DP:PL      0/0:19:54:19:0,54,810
    -chr1   1013490 rs4615788       C       G,<NON_REF>     806.03  .       DB;DP=20;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQ=72000.00     GT:AD:DP:GQ:PGT:PID:PL:SB       1/1:0,20,0:20:60:0|1:1013466_T_TA:820,60,0,820,60,820:0,0,20,0
    +chr1   1013490 rs4615788       C       G,<NON_REF>     806.03  .       DB;DP=20;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQ=73300.00     GT:AD:DP:GQ:PGT:PID:PL:SB       1/1:0,20,0:20:60:0|1:1013466_T_TA:820,60,0,820,60,820:0,0,20,0
     chr1   1013491 .       G       <NON_REF>       .       .       END=1013512     GT:DP:GQ:MIN_DP:PL      0/0:23:60:20:0,60,720
     chr1   1013513 .       C       <NON_REF>       .       .       END=1013513     GT:DP:GQ:MIN_DP:PL      0/0:27:72:27:0,72,1080
     chr1   1013514 .       G       <NON_REF>       .       .       END=1013515     GT:DP:GQ:MIN_DP:PL      0/0:26:69:25:0,69,1035
    @@ -64237,11 +64230,11 @@
     chr1   6549492 .       A       <NON_REF>       .       .       END=6549493     GT:DP:GQ:MIN_DP:PL      0/0:19:24:19:0,24,360
     chr1   6549494 .       A       <NON_REF>       .       .       END=6549497     GT:DP:GQ:MIN_DP:PL      0/0:21:30:20:0,30,450
     chr1   6549498 .       T       <NON_REF>       .       .       END=6549498     GT:DP:GQ:MIN_DP:PL      0/0:20:27:20:0,27,405
    -chr1   6549499 rs11122091      C       A,<NON_REF>     289.60  .       BaseQRankSum=-2.273;ClippingRankSum=0.000;DB;DP=21;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.000;RAW_MQ=75600.00;ReadPosRankSum=0.282     GT:AD:DP:GQ:PGT:PID:PL:SB       0/1:11,10,0:21:99:0|1:6549499_C_A:297,0,432,331,462,793:9,2,9,1
    +chr1   6549499 rs11122091      C       A,<NON_REF>     289.60  .       BaseQRankSum=-2.273;ClippingRankSum=0.000;DB;DP=21;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=-0.858;RAW_MQ=76900.00;ReadPosRankSum=0.282    GT:AD:DP:GQ:PGT:PID:PL:SB       0/1:11,10,0:21:99:0|1:6549499_C_A:297,0,432,331,462,793:9,2,9,1
     chr1   6549500 .       C       <NON_REF>       .       .       END=6549500     GT:DP:GQ:MIN_DP:PL      0/0:21:26:21:0,26,711
     chr1   6549501 .       T       <NON_REF>       .       .       END=6549502     GT:DP:GQ:MIN_DP:PL      0/0:22:29:22:0,29,690
    -chr1   6549503 rs34630031      CAT     C,<NON_REF>     408.60  .       BaseQRankSum=-1.257;ClippingRankSum=0.000;DB;DP=24;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.000;RAW_MQ=86400.00;ReadPosRankSum=0.164     GT:AD:DP:GQ:PGT:PID:PL:SB       0/1:13,11,0:24:99:1|0:6549499_C_A:416,0,484,455,517,972:11,2,9,2
    -chr1   6549506 .       A       <NON_REF>       .       .       END=6549506     GT:DP:GQ:MIN_DP:PL      0/0:23:0:23:0,0,166
    +chr1   6549503 rs34630031      CAT     C,<NON_REF>     408.60  .       BaseQRankSum=-1.257;ClippingRankSum=0.000;DB;DP=24;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=1.171;RAW_MQ=87700.00;ReadPosRankSum=0.164     GT:AD:DP:GQ:PGT:PID:PL:SB       0/1:13,11,0:24:99:1|0:6549499_C_A:416,0,484,455,517,972:11,2,9,2
    +chr1   6549506 .       A       <NON_REF>       .       .       END=6549506     GT:DP:GQ:MIN_DP:PL      0/0:24:0:24:0,0,127
     chr1   6549507 .       G       <NON_REF>       .       .       END=6549510     GT:DP:GQ:MIN_DP:PL      0/0:23:57:22:0,57,855
     chr1   6549511 .       T       <NON_REF>       .       .       END=6549520     GT:DP:GQ:MIN_DP:PL      0/0:22:63:21:0,63,763
     chr1   6549521 .       C       <NON_REF>       .       .       END=6549528     GT:DP:GQ:MIN_DP:PL      0/0:27:72:25:0,72,1080
    
  • shleeshlee CambridgeMember, Broadie, Moderator admin
    edited May 2017

    Hi @mmokrejs,

    Therefore I even went to use less derived BAM files (only processed to remove duplicates and being realigned in sync with the tumor sample from the same individual).

    Are you analyzing tumor samples? In this case, to call somatic mutations against a matched normal, our Best Practices recommend you call using MuTect. Currently, MuTect1 is what is available for data analysis as MuTect2 is still in beta status. MuTect1 calls only somatic SNVs (single nucleotide variants) and does not call indels. Because of this, for workflows that use MuTect1, we still recommend indel realignment.

    Regarding https://software.broadinstitute.org/gatk/blog?id=7847:

    notably the first two tables shown in the article are cryptic to me

    The article is geared towards users who do germline variant calling with sequencing data from newer sequencing technology. Newer sequencing techs allow for longer reads and higher quality sequences. The article gets into the minutia of removing indel realignment for this type of data. Depending on the distribution of types of reads, e.g. duplicate reads (table 1) or low mapping quality reads (MAPQ; table 2), you may consider retaining indel realignment. My apologies if this was not clear.

  • mmokrejsmmokrejs Czech RepublicMember
    edited May 2017

    Hi @Shlee,
    for somatic mutation analysis I do use MuTect2. However, I wanted to used the data also for other analyses so for the joint genotyping I only used just the germinal tissue exome samples and I will use them later along other genome or exome samples. at the moment I just analyzed all the germinal exome tissues (and got the bad VQSR outputs).

    Could the inability to get meaningful VQSR models (per PDF plots) be possibly caused by badly demultiplexed samples? I never had in my hand original FASTQ files but I tend to speculate now the reads of those 35 datasets were not demultiplexed properly and reads from one sample appear as a true background contamination in another sample. Even worse, as there were the cancer samples from the same subjects, the background could be biased also thank to the badly demultiplexed cancer samples. Provided an SNP is authentic in one sample it could appear as a true background noise in another sample, and vice versa.

    Would that explain why VQSR does not work in my hands? Do you have an idea how to check for this? Other SNP calling approaches obviously do not break completely, so VQSR is somewhat exceptional in this regard due to its sensitivity. I even tried HaplotypeCaller --genotyping_mode DISCOVERY --useNewAFCalculator --emitRefConfidence GVCF --pcr_indel_model HOSTILE but it did not help in terms of VQSR models. Hence why I am thinking of wrong demultiplexing now.

    Pleake make a proper legend for tables 1 and 2. I would like to read in verbatim what am I supposed to see in them. Finding in the main text what relates to them is too difficult. Isn't MQ sometimes peaking at 75? Wouldn't VariantRecalibrator -MQCap=75 be more approariate? Does it hurt if I set it higher than is the actual max() value?

    Post edited by mmokrejs on
  • mmokrejsmmokrejs Czech RepublicMember
    edited May 2017

    I went to check my MAX MQ values. First of all, in one case there is 'nan' value (I guess GATK GenotypeGVCFs should have raised an error either when poking over --variant sample.g.vcf input record or when calculating wrong values internally across all smaples) and anyway, my MAX is much larger that 70.

    $ bcftools query -f '%INFO/MQ\n' mysamples.raw.vcf | sort | uniq -c | sort -k2n,2 | less
          1 0
          2 nan
          3 3.7
          1 3.84
          1 4.22
          2 4.5
    ...
          1 447.39
          1 480.94
          1 490.46
          1 535.47
          1 967.73
    $
    
  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hi @mmokerjs,

    I think you would benefit from reading the papers we list on https://software.broadinstitute.org/gatk/documentation/article.php?id=6201. The last paper has a section titled "BASIC PROTOCOL 3" that describes how VQSR works. Also, check out our official Best Practice Workflows documentation at https://software.broadinstitute.org/gatk/best-practices/. Better yet, if you could attend one of our workshops, I think you would benefit immensely. I see you are posting from the Czech Republic. We have locations and dates for four upcoming GATK workshops in Europe:

    • Cambridge, UK: 7/12-7/14 (Wednesday to Friday)
    • Edinburgh, UK: 7/17-7/19 (Monday to Wednesday)
    • Helsinki, Finland: 9/13-9/15 (Wednesday to Friday)
    • Oxford, UK: 9/18-9/20 (Monday to Wednesday)

    Let's ask @Sheila to comment on the NaN values, as I am unfamiliar with them. Here are my responses to some of the other questions you raise. Remember, the GATK Communications and Support team has limited time and resources and we cannot answer all of the questions we receive comprehensively. We do our best.

    Let me reiterate that MuTect1 is what is available for data analysis as MuTect2 is still in beta status. Beta status means you can play around with the tool but that we do NOT officially sanction it for analysis of data for publication or medical research. The Broad Cancer Program currently uses MuTect1 to call SNVs and MuTect2 to call indels. This workflow that they use, well, you can find it outlined as a WDL script in FireCloud.

    If you suspect bad demultiplexing, then I suggest you talk to the sequencing center that did your sequencing and demultiplexing about this possibility. They will have kept careful records.

    So your question about VQSR relates to germline calling of the matched normals. Remember that VQSR needs to model what is a good versus a bad variant call. What would happen if your germline callset did not have bad calls?

    Would that explain why VQSR does not work in my hands? Do you have an idea how to check for this? Other SNP calling approaches obviously do not break completely, so VQSR is somewhat exceptional in this regard due to its sensitivity.

    Remember that VQSR is for variant filtration. It does not call variants itself. Thanks for the complement on VQSR's exceptional sensitivity.

    The mapping quality I speak of in the indel realignment blogpost is the SAM record MAPQ that measures genome-wide mappability. This is different from the VCF record MQ annotation, albeit related (e.g. see here). Finally, my post on indel realignment is a blogpost, not a white paper nor peer-reviewed journal paper and so the rules of academic publishing, well, we may or may not choose to follow these. Thanks for the suggestion to include legends in our blogpost figures. I agree these would be useful and I'll keep your suggestion in mind going forward.

  • shleeshlee CambridgeMember, Broadie, Moderator admin
    edited May 2017

    @mmokerjs,

    After discussing with @Sheila your other recent questions on our forum, and your comment:

    I will probably re-create the per-sample GVCF files from the original BAM files as output from bwa in ALT-aware mode and see if that makes magically a difference.

    It appears that you are aligning to GRCh38 without ALT-awareness. I think this is problematic. Please refer to Blog#8180 and Tutorial#8017 on the implications of mapping to a reference with alternate contigs.

    Let us know if aligning in an ALT-aware manner solves your blank VQSR problem.

  • mmokrejsmmokrejs Czech RepublicMember
    edited May 2017

    @Shlee,

    No, I do align in proper ALT-aware mode.

    $ samtools view -H mysample-PB/realignedBAM/mysample-PB.bwa.postalt.removed_duplicates.realignedtogether.BQSR.namesorted.fixmate.calmd.sorted.bam | grep HLA-DRB1 | tail
    @SQ     SN:HLA-DRB1*14:05:01    LN:13933        AH:*
    @SQ     SN:HLA-DRB1*14:54:01    LN:13936        AH:*
    @SQ     SN:HLA-DRB1*15:01:01:01 LN:11080        AH:*
    @SQ     SN:HLA-DRB1*15:01:01:02 LN:11571        AH:*
    @SQ     SN:HLA-DRB1*15:01:01:03 LN:11056        AH:*
    @SQ     SN:HLA-DRB1*15:01:01:04 LN:11056        AH:*
    @SQ     SN:HLA-DRB1*15:02:01    LN:10313        AH:*
    @SQ     SN:HLA-DRB1*15:03:01:01 LN:11567        AH:*
    @SQ     SN:HLA-DRB1*15:03:01:02 LN:11569        AH:*
    @SQ     SN:HLA-DRB1*16:02:01    LN:11005        AH:*
    $
    
  • shleeshlee CambridgeMember, Broadie, Moderator admin
    edited May 2017

    Okay, that's good. Since I'm new to your questions, tell me, based on the information you've gathered so far, what do you think is going on?

Sign In or Register to comment.