Should I provide the exome target list (-L argu) even while calling gVCF file using Haplotypecaller?

NandaNanda CanadaMember
edited April 2017 in Ask the GATK team

Hi,

Recently we performed exome sequencing using Nextera Illumina platform for three samples (Father, Mother and Son). I downloaded the exome interval list from Illumina's website.

1) Trimmed the raw reads
2) Aligned the trimmed reads against the human reference hg19 as recommended for exome-sequencing
3) Then sorted, deduped, recalibrated the bam file.
4) Then performed variant calling in two steps process for all three samples individually
4.1) Used the GATK Haplotype Caller tool in GVCF mode
Command: java -Xmx16g -jar GenomeAnalysisTK.jar - T Haplotypecaller -R /GATK_bundle/hg19.fa -I sample1.sorted.dedup.recal.bam --emitRefConfidence GVCF --dbsnp /GATK_bundle/dbsnp.138.hg19.vcf -o sample1.raw.g.vcf
4.2) Used GenotypeGVCFs (Joint SNP calling) for all three samples together
Command: java -Xmx16g -jar GenomeAnalysisTK.jar - T GenotypeGVCFs -R /GATK_bundle/hg19.fa --variant sample1.raw.g.vcf --variant sample2.raw.g.vcf --variant sample3.raw.g.vcf --dbsnp /GATK_bundle/dbsnp.138.hg19.vcf -o sample1.2.3.trio.raw.vcf

In the above command, I didn't use the Illumina's exome interval list used for targeting the exomes in sequencing process.

As per this link "https://software.broadinstitute.org/gatk/documentation/article.php?id=4669", under the example section of GATK command lines, for exome sequencing the article suggests us to provide the exome targets using -L argument.

I have following queries,as per the aforementioned article
1) Should I provide the exome target list (-L argument) only while calling regular VCF file using Haplotype caller?
or
2) Should I provide the exome target list (-L argument) even while calling gVCF file using Haplotype caller?

Best Answer

Answers

  • NandaNanda CanadaMember
    edited April 2017

    Thanks Geraldine VdAuwera. Sure, I will subset the vcf records which I got from GenotypeGVCFs tool.

    I have another query regarding this post gatkforums.broadinstitute.org/gatk/discussion/9396/i-have-50-exome-samples-belong-to-25-families-do-i-run-genotypevcfs-on-familywise-or-50-together/p1?new=1, these 50 exome samples were sequenced on different batches. Some of the batches are even 2 years old. For all the 50 exome samples, can I use the latest exome interval list from illumina or should use exome interval list that has been used for sequencing?

    Issue · Github
    by Sheila

    Issue Number
    1966
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Nanda
    Hi,

    You should use the intersection of the interval lists because the union may decrease statistical power. For example, if you only have information for some samples and not others at a site, you cannot be confident how much importance that site really has.

    We recommend using the intersection because you will have data for all samples in your final VCF. When you use the union, if you have information for only some samples that show importance, you will have to go back and re-sequence the samples with missing information. Ultimately, it is up to you to determine which is better for your purposes.

    -Sheila

  • mmokrejsmmokrejs Czech RepublicMember
    edited April 2017

    @Geraldine_VdAuwera said:
    You should always use the target list argument when running HaplotypeCaller on exomes or any other targeted sequence type because it will be more efficient computationally.

    Interesting, I thought I read guides on GVCF properly and hence for my exome-capture data I did not use -L to select ranges to be acted upon by HaplotypeCaller when yielding per-sample GVCF. The guide seemed the statistics is robust enough and that it is the aim of the approach to keep track of even low-coverage regions. My impression was that my low-coverage regions are clearly regions with mis-mapped reads but I did not object. My bad, now I burned my hands (read further below).

    We even encourage you to use a target list when running on whole genomes (and we provide genome target lists) because this speeds up processing by skipping regions we know from experience are pretty much impossible to call.

    Please refer to concrete filenames inside your GATK bundles.

    That being said, if you have already run everything without a target list, don't panic -- you can use the results; just be sure to subset your final VCF to the target list. However, don't subset calls at the GVCF level, because that could cause technical problems. Run GenotypeGVCFs first to produce the VCF, then subset the calls.

    It is too late because I merged 35 patient GVCFs together into the multi-sample GVCF. Here you can see the bad outcome from VQSR. The PDF files contain 21 pages despite my 35 samples. The pages have no Title so I have no clue what they are about and why there are just 21 instead of 35. The Rscripts run as a part of VQSR should be more thorough in including Titles for each drawing (https://software.broadinstitute.org/gatk/documentation/article.php?id=39). It seems the VCF lines with merely zero counts in most samples ruined the statistics:

    chr1    868860  .       T       C       19.87   PASS    AC=2;AF=0.250;AN=8;DP=6;ExcessHet=0.3218;FS=0.000;MLEAC=2;MLEAF=0.250;MQ=44.10;QD=9.93;SOR=0.693;VQSLOD=6.13;culprit=FS GT:AD:DP:GQ:PL  ./.:0,0:0:.:0,0,0       0/0:2,0:2:6:0,6,48      ./.:0,0:0:.:0,0,0       ./.:0,0:0:.:0,0,0       ./.:0,0:0:.:0,0,0       ./.:0,0:0:.:0,0,0       ./.:0,0:0:.:0,0,0       ./.:0,0:0:.:0,0,0       ./.:0,0:0:.:0,0,0       ./.:0,0:0:.:0,0,0       ./.:0,0:0:.:0,0,0       0/0:1,0:1:3:0,3,26      ./.:0,0:0:.:0,0,0       ./.:0,0:0:.:0,0,0       ./.:0,0:0:.:0,0,0       ./.:0,0:0:.:0,0,0       ./.:0,0:0:.:0,0,0       ./.:0,0:0:.:0,0,0       ./.:0,0:0:.:0,0,0       ./.:0,0:0:.:0,0,0       ./.:0,0:0:.:0,0,0       ./.:0,0:0:.:0,0,0       0/0:1,0:1:3:0,3,20      ./.:0,0:0:.:0,0,0       ./.:0,0:0:.:0,0,0       ./.:0,0:0:.:0,0,0       ./.:0,0:0:.:0,0,0       ./.:0,0:0:.:0,0,0       ./.:0,0:0:.:0,0,0       ./.:0,0:0:.:0,0,0       ./.:0,0:0:.:0,0,0       ./.:0,0:0:.:0,0,0
           1/1:0,2:2:6:49,6,0      ./.:0,0:0:.:0,0,0       ./.:0,0:0:.:0,0,0
    

    Now I will go back and add hard filters to HaplotypeCaller based on coverage and/or use -L to specify ranges used for target enrichment (Agilent human exome v5, which would be a bit narrower than those ranges actually well-covered).

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @mmokrejs
    Hi,

    You can still subset the final VCF using -L. Have you tried that? Do the results of VQSR look better?

    -Sheila

  • mmokrejsmmokrejs Czech RepublicMember
    edited April 2017

    Yes, the processing just finished but I do not have really better results.

    Here I generated one of the GVCF files for a single sample.

    ##GATKCommandLine.HaplotypeCaller=<ID=HaplotypeCaller,Version=3.7-0-gcfedb67,Date="Thu Apr 20 20:07:49 CEST 2017",Epoch=1492711669348,CommandLineOptions="analysis_type=HaplotypeCaller input_file=[mysample-42_42-PB.bwa.postalt.removed_duplicates.realignedtogether.BQSR.namesorted.fixmate.calmd.sorted.bam] showFullBamList=false read_buffer_size=null read_filter=[] disable_read_filter=[] intervals=[S04380110__Agilent_SureSelect_Human_All_Exon_V5/hg38/S04380110_Covered.bed] excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=100 reference_sequence=ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/hs38DH.fa nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 maxRuntimeUnits=MINUTES downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=500 baq=OFF baqGapOpenPenalty=40.0 refactor_NDN_cigar_string=false fix_misencoded_quality_scores=false allow_potentially_misencoded_quality_scores=false useOriginalQualities=false defaultBaseQualities=-1 performanceLog=null BQSR=null quantize_quals=0 static_quantized_quals=null round_down_quantized=false disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6 globalQScorePrior=-1.0 secondsBetweenProgressUpdates=10 validation_strictness=SILENT remove_program_records=false keep_program_records=false sample_rename_mapping_file=null unsafe=null disable_auto_index_creation_and_locking_when_reading_rods=false no_cmdline_in_header=false sites_only=false never_trim_vcf_format_field=false bcf=false bam_compression=null simplifyBAM=false disable_bam_indexing=false generate_md5=false num_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false variant_index_type=DYNAMIC_SEEK variant_index_parameter=-1 reference_window_stop=0 phone_home= gatk_key=null tag=NA logging_level=INFO log_to_file=null help=false version=false likelihoodCalculationEngine=PairHMM heterogeneousKmerSizeResolution=COMBO_MIN dbsnp=(RodBinding name=dbsnp source=ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/GATK/00-All.vcf.gz) dontTrimActiveRegions=false maxDiscARExtension=25 maxGGAARExtension=300 paddingAroundIndels=150 paddingAroundSNPs=20 comp=[] annotation=[StrandBiasBySample] excludeAnnotation=[ChromosomeCounts, FisherStrand, StrandOddsRatio, QualByDepth] group=[StandardAnnotation, StandardHCAnnotation] debug=false useFilteredReadsForAnnotations=false emitRefConfidence=GVCF bamOutput=null bamWriterType=CALLED_HAPLOTYPES emitDroppedReads=false disableOptimizations=false annotateNDA=false useNewAFCalculator=false heterozygosity=0.001 indel_heterozygosity=1.25E-4 heterozygosity_stdev=0.01 standard_min_confidence_threshold_for_calling=-0.0 standard_min_confidence_threshold_for_emitting=30.0 max_alternate_alleles=6 max_genotype_count=1024 max_num_PL_values=100 input_prior=[] sample_ploidy=2 genotyping_mode=DISCOVERY alleles=(RodBinding name= source=UNBOUND) contamination_fraction_to_filter=0.0 contamination_fraction_per_sample_file=null p_nonref_model=null exactcallslog=null output_mode=EMIT_VARIANTS_ONLY allSitePLs=true gcpHMM=10 pair_hmm_implementation=VECTOR_LOGLESS_CACHING pair_hmm_sub_implementation=ENABLE_ALL always_load_vector_logless_PairHMM_lib=false phredScaledGlobalReadMismappingRate=45 noFpga=false sample_name=null kmerSize=[10, 25] dontIncreaseKmerSizesForCycles=false allowNonUniqueKmersInRef=false numPruningSamples=1 recoverDanglingHeads=false doNotRecoverDanglingBranches=false minDanglingBranchLength=4 consensus=false maxNumHaplotypesInPopulation=128 errorCorrectKmers=false minPruning=2 debugGraphTransformations=false allowCyclesInKmerGraphToGeneratePaths=false graphOutput=null kmerLengthForReadErrorCorrection=25 minObservationsForKmerToBeSolid=20 GVCFGQBands=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 70, 80, 90, 99] indelSizeToEliminateInRefModel=10 min_base_quality_score=10 includeUmappedReads=false useAllelesTrigger=false doNotRunPhysicalPhasing=false keepRG=null justDetermineActiveRegions=false dontGenotype=false dontUseSoftClippedBases=false captureAssemblyFailureBAM=false errorCorrectReads=false pcr_indel_model=CONSERVATIVE maxReadsInRegionPerSample=10000 minReadsPerAlignmentStart=10 mergeVariantsViaLD=false activityProfileOut=null activeRegionOut=null activeRegionIn=null activeRegionExtension=null forceActive=false activeRegionMaxSize=null bandPassSigma=null maxReadsInMemoryPerSample=30000 maxTotalReadsInMemory=10000000 maxProbPropagationDistance=50 activeProbabilityThreshold=0.002 min_mapping_quality_score=20 filter_reads_with_N_cigar=false filter_mismatching_base_and_quals=false filter_bases_not_stored=false">
    

    Here I merged the 35 exomes into a single GVCF, notably I even used now --useNewAFCalculator.

    GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T GenotypeGVCFs -nt 16 -R ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/hs38DH.fa --dbsnp ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/GATK/00-All.vcf.gz --useNewAFCalculator -o mysamples.raw.vcf --variant $sample1.g.vcf --variant $sample2.g.vcf ...
    
    INFO  11:21:26,420 GenomeAnalysisEngine - Strictness is SILENT 
    INFO  11:21:28,104 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 
    WARN  11:21:36,056 IndexDictionaryUtils - Track dbsnp doesn't have a sequence dictionary built in, skipping dictionary validation 
    INFO  11:21:36,064 MicroScheduler - Running the GATK in parallel mode with 16 total threads, 1 CPU thread(s) for each of 16 data thread(s), of 16 processors available on this machine 
    INFO  11:21:37,614 GenomeAnalysisEngine - Preparing for traversal 
    INFO  11:21:37,621 GenomeAnalysisEngine - Done preparing for traversal 
    INFO  11:21:37,622 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
    INFO  11:21:37,622 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining 
    INFO  11:21:37,622 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime 
    WARN  11:21:38,105 StrandBiasTest - StrandBiasBySample annotation exists in input VCF header. Attempting to use StrandBiasBySample values to calculate strand bias annotation values. If no sample has the SB genotype annotation, annotation may still fail. 
    WARN  11:21:38,106 StrandBiasTest - StrandBiasBySample annotation exists in input VCF header. Attempting to use StrandBiasBySample values to calculate strand bias annotation values. If no sample has the SB genotype annotation, annotation may still fail. 
    INFO  11:21:38,106 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files 
    WARN  11:21:40,969 HaplotypeScore - Annotation will not be calculated, must be called from UnifiedGenotyper, not org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs 
    INFO  11:22:07,632 ProgressMeter -   chr1:15494039    189106.0    30.0 s       2.6 m        0.5%   103.8 m     103.3 m 
    INFO  11:22:37,682 ProgressMeter -   chr1:34197349   2357356.0    60.0 s      25.0 s        1.1%    94.1 m      93.1 m 
    INFO  11:23:07,696 ProgressMeter -  chr1:125169834   9959165.0    90.0 s       9.0 s        3.9%    38.6 m      37.1 m 
    INFO  11:23:37,697 ProgressMeter -  chr1:210999841   1.6488893E7   120.0 s       7.0 s        6.6%    30.5 m      28.5 m 
    INFO  11:24:07,714 ProgressMeter -   chr2:73002681   2.539001E7     2.5 m       5.0 s       10.0%    25.0 m      22.5 m 
    INFO  11:24:37,715 ProgressMeter -  chr2:176999313   3.3869221E7     3.0 m       5.0 s       13.2%    22.7 m      19.7 m 
    INFO  11:25:07,716 ProgressMeter -   chr3:46266160   4.2509927E7     3.5 m       4.0 s       16.7%    21.0 m      17.5 m 
    INFO  11:25:37,719 ProgressMeter -  chr3:147298951   5.0672524E7     4.0 m       4.0 s       19.8%    20.2 m      16.2 m 
    INFO  11:26:07,721 ProgressMeter -   chr4:30130257   5.7461949E7     4.5 m       4.0 s       22.4%    20.1 m      15.6 m 
    INFO  11:26:37,722 ProgressMeter -  chr4:149066900   6.7334954E7     5.0 m       4.0 s       26.1%    19.2 m      14.2 m 
    INFO  11:27:07,723 ProgressMeter -  chr5:104999922   7.6957794E7     5.5 m       4.0 s       30.6%    18.0 m      12.5 m 
    INFO  11:27:37,758 ProgressMeter -   chr6:17998488   8.531772E7     6.0 m       4.0 s       33.5%    17.9 m      11.9 m 
    INFO  11:28:07,759 ProgressMeter -  chr6:123999834   9.4004119E7     6.5 m       4.0 s       36.8%    17.6 m      11.1 m 
    INFO  11:28:37,761 ProgressMeter -   chr7:51998993   1.01637556E8     7.0 m       4.0 s       39.9%    17.5 m      10.5 m 
    INFO  11:29:07,762 ProgressMeter -    chr8:7055327   1.0985578E8     7.5 m       4.0 s       43.5%    17.3 m       9.8 m 
    INFO  11:29:37,764 ProgressMeter -   chr8:93131611   1.18074387E8     8.0 m       4.0 s       46.1%    17.3 m       9.3 m 
    INFO  11:30:07,766 ProgressMeter -  chr9:115021154   1.2797869E8     8.5 m       3.0 s       51.3%    16.6 m       8.1 m 
    INFO  11:30:37,768 ProgressMeter -  chr10:19348104   1.32098422E8     9.0 m       4.0 s       52.7%    17.1 m       8.1 m 
    INFO  11:31:07,770 ProgressMeter - chr10:121485710   1.4054994E8     9.5 m       4.0 s       55.8%    17.0 m       7.5 m 
    INFO  11:31:37,771 ProgressMeter -  chr11:69999116   1.48224067E8    10.0 m       4.0 s       58.4%    17.1 m       7.1 m 
    INFO  11:32:07,772 ProgressMeter -  chr12:15899984   1.56204694E8    10.5 m       4.0 s       60.9%    17.2 m       6.7 m 
    INFO  11:32:37,774 ProgressMeter - chr12:119037920   1.64485809E8    11.0 m       4.0 s       64.1%    17.2 m       6.2 m 
    INFO  11:33:07,775 ProgressMeter -  chr13:86999031   1.72326288E8    11.5 m       4.0 s       67.3%    17.1 m       5.6 m 
    INFO  11:33:37,777 ProgressMeter -  chr14:96998790   1.80636815E8    12.0 m       3.0 s       71.1%    16.9 m       4.9 m 
    INFO  11:34:07,778 ProgressMeter -  chr15:90084756   1.87839344E8    12.5 m       3.0 s       74.2%    16.8 m       4.3 m 
    INFO  11:34:37,779 ProgressMeter -  chr16:76999816   1.95111601E8    13.0 m       3.0 s       77.0%    16.9 m       3.9 m 
    INFO  11:35:07,784 ProgressMeter -  chr17:48051036   2.01614335E8    13.5 m       4.0 s       78.9%    17.1 m       3.6 m 
    INFO  11:35:37,787 ProgressMeter -  chr18:54280883   2.09652692E8    14.0 m       4.0 s       81.7%    17.1 m       3.1 m 
    INFO  11:36:07,788 ProgressMeter -  chr19:24896988   2.15700809E8    14.5 m       4.0 s       83.3%    17.4 m       2.9 m 
    INFO  11:36:37,790 ProgressMeter -  chr20:21126038   2.21494393E8    15.0 m       4.0 s       85.0%    17.7 m       2.7 m 
    INFO  11:37:07,792 ProgressMeter -  chr22:26292535   2.29245708E8    15.5 m       4.0 s       88.6%    17.5 m     119.0 s 
    INFO  11:37:37,795 ProgressMeter -   chrX:65486906   2.36493162E8    16.0 m       4.0 s       91.4%    17.5 m      90.0 s 
    INFO  11:38:07,796 ProgressMeter -    chrY:8880673   2.4253789E8    16.5 m       4.0 s       94.5%    17.5 m      57.0 s 
    INFO  11:38:37,797 ProgressMeter - chr19_KI270938v1_alt:275871   2.43173867E8    17.0 m       4.0 s       99.7%    17.0 m       2.0 s 
    INFO  11:39:07,799 ProgressMeter - chr19_KI270938v1_alt:275871   2.43314032E8    17.5 m       4.0 s       99.7%    17.5 m       2.0 s 
    INFO  11:39:37,801 ProgressMeter - chr19_KI270938v1_alt:275871   2.43314032E8    18.0 m       4.0 s       99.7%    18.0 m       2.0 s 
    INFO  11:40:07,803 ProgressMeter - chr19_KI270938v1_alt:275871   2.43314032E8    18.5 m       4.0 s       99.7%    18.6 m       3.0 s 
    INFO  11:40:37,804 ProgressMeter - chr19_KI270938v1_alt:275871   2.43314032E8    19.0 m       4.0 s       99.7%    19.1 m       3.0 s 
    INFO  11:41:07,806 ProgressMeter - chr19_KI270938v1_alt:275871   2.43314032E8    19.5 m       4.0 s       99.7%    19.6 m       3.0 s 
    INFO  11:41:37,808 ProgressMeter - chr19_KI270938v1_alt:275871   2.43314032E8    20.0 m       4.0 s       99.7%    20.1 m       3.0 s 
    INFO  11:42:07,810 ProgressMeter - chr19_KI270938v1_alt:275871   2.43314032E8    20.5 m       5.0 s       99.7%    20.6 m       3.0 s 
    INFO  11:42:37,812 ProgressMeter - chr19_KI270938v1_alt:275871   2.43314032E8    21.0 m       5.0 s       99.7%    21.1 m       3.0 s 
    INFO  11:43:07,813 ProgressMeter - chr19_KI270938v1_alt:275871   2.43314032E8    21.5 m       5.0 s       99.7%    21.6 m       3.0 s 
    INFO  11:43:37,814 ProgressMeter - chr19_KI270938v1_alt:275871   2.43314032E8    22.0 m       5.0 s       99.7%    22.1 m       3.0 s 
    INFO  11:44:07,815 ProgressMeter - chr19_KI270938v1_alt:275871   2.43314032E8    22.5 m       5.0 s       99.7%    22.6 m       3.0 s 
    INFO  11:44:37,816 ProgressMeter - chr19_KI270938v1_alt:275871   2.43314032E8    23.0 m       5.0 s       99.7%    23.1 m       3.0 s 
    INFO  11:45:07,817 ProgressMeter - chr19_KI270938v1_alt:275871   2.43314032E8    23.5 m       5.0 s       99.7%    23.6 m       3.0 s 
    INFO  11:45:37,826 ProgressMeter - chr19_KI270938v1_alt:275871   2.43314032E8    24.0 m       5.0 s       99.7%    24.1 m       3.0 s 
    INFO  11:46:07,828 ProgressMeter - chr19_KI270938v1_alt:275871   2.43314032E8    24.5 m       6.0 s       99.7%    24.6 m       4.0 s 
    INFO  11:46:37,830 ProgressMeter - chr19_KI270938v1_alt:275871   2.43314032E8    25.0 m       6.0 s       99.7%    25.1 m       4.0 s 
    INFO  11:47:07,831 ProgressMeter - chr19_KI270938v1_alt:275871   2.43314032E8    25.5 m       6.0 s       99.7%    25.6 m       4.0 s 
    INFO  11:47:09,294 ProgressMeter -            done   2.43460127E8    25.5 m       6.0 s       99.7%    25.6 m       4.0 s 
    INFO  11:47:09,294 ProgressMeter - Total runtime 1531.67 secs, 25.53 min, 0.43 hours 
    ------------------------------------------------------------------------------------------
    Done. There were 4 WARN messages, the first 4 are repeated below.
    WARN  11:21:36,056 IndexDictionaryUtils - Track dbsnp doesn't have a sequence dictionary built in, skipping dictionary validation 
    WARN  11:21:38,105 StrandBiasTest - StrandBiasBySample annotation exists in input VCF header. Attempting to use StrandBiasBySample values to calculate strand bias annotation values. If no sample has the SB genotype annotation, annotation may still fail. 
    WARN  11:21:38,106 StrandBiasTest - StrandBiasBySample annotation exists in input VCF header. Attempting to use StrandBiasBySample values to calculate strand bias annotation values. If no sample has the SB genotype annotation, annotation may still fail. 
    WARN  11:21:40,969 HaplotypeScore - Annotation will not be calculated, must be called from UnifiedGenotyper, not org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs 
    ------------------------------------------------------------------------------------------
    

    Here I created the model for VQSR.

    GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T VariantRecalibrator -nt 16 -R ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/hs38DH.fa --input mysamples.raw.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 ftp.broadinstitute.org/bundle/hg38/hapmap_3.3.hg38.vcf.gz -resource:omni,known=false,training=true,truth=false,prior=12.0 ftp.broadinstitute.org/bundle/hg38/1000G_omni2.5.hg38.vcf.gz -resource:1000G,known=false,training=true,truth=false,prior=10.0 ftp.broadinstitute.org/bundle/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/GATK/00-All.vcf.gz -an QD -an FS -an SOR -an MQ -an MQRankSum -an ReadPosRankSum -MQCap 70 -an InbreedingCoeff --target_titv 3.2 -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -recalFile recalibrate_SNP.recal -tranchesFile recalibrate_SNP.tranches -rscriptFile recalibrate_SNP_plots.R
    
    INFO  11:47:14,381 GenomeAnalysisEngine - Strictness is SILENT 
    INFO  11:47:15,819 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 
    WARN  11:47:17,288 IndexDictionaryUtils - Track hapmap doesn't have a sequence dictionary built in, skipping dictionary validation 
    WARN  11:47:17,288 IndexDictionaryUtils - Track omni doesn't have a sequence dictionary built in, skipping dictionary validation 
    WARN  11:47:17,289 IndexDictionaryUtils - Track 1000G doesn't have a sequence dictionary built in, skipping dictionary validation 
    WARN  11:47:17,289 IndexDictionaryUtils - Track dbsnp doesn't have a sequence dictionary built in, skipping dictionary validation 
    INFO  11:47:17,297 MicroScheduler - Running the GATK in parallel mode with 16 total threads, 1 CPU thread(s) for each of 16 data thread(s), of 16 processors available on this machine 
    INFO  11:47:18,733 GenomeAnalysisEngine - Preparing for traversal 
    INFO  11:47:18,741 GenomeAnalysisEngine - Done preparing for traversal 
    INFO  11:47:18,741 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
    INFO  11:47:18,741 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining 
    INFO  11:47:18,742 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime 
    INFO  11:47:18,749 TrainingSet - Found hapmap track:    Known = false   Training = true         Truth = true    Prior = Q15.0 
    INFO  11:47:18,749 TrainingSet - Found omni track:      Known = false   Training = true         Truth = false   Prior = Q12.0 
    INFO  11:47:18,749 TrainingSet - Found 1000G track:     Known = false   Training = true         Truth = false   Prior = Q10.0 
    INFO  11:47:18,750 TrainingSet - Found dbsnp track:     Known = true    Training = false        Truth = false   Prior = Q2.0 
    INFO  11:47:48,745 ProgressMeter -   chr1:80998348   4753246.0    30.0 s       6.0 s        2.5%    19.9 m      19.4 m 
    INFO  11:48:18,746 ProgressMeter -   chr2:17998995   1.404037E7    60.0 s       4.0 s        8.3%    12.1 m      11.1 m 
    INFO  11:48:48,747 ProgressMeter -  chr2:185998414   2.3584505E7    90.0 s       3.0 s       13.5%    11.1 m       9.6 m 
    INFO  11:49:18,748 ProgressMeter -  chr3:108999457   3.3217497E7   120.0 s       3.0 s       18.7%    10.7 m       8.7 m 
    INFO  11:49:48,750 ProgressMeter -   chr4:88998852   4.3616064E7     2.5 m       3.0 s       24.2%    10.3 m       7.8 m 
    INFO  11:50:18,751 ProgressMeter -   chr5:66999691   5.3086592E7     3.0 m       3.0 s       29.4%    10.2 m       7.2 m 
    INFO  11:50:48,752 ProgressMeter -   chr6:58036395   6.2324218E7     3.5 m       3.0 s       34.8%    10.1 m       6.6 m 
    INFO  11:51:18,753 ProgressMeter -   chr7:55394645   7.2100905E7     4.0 m       3.0 s       40.0%    10.0 m       6.0 m 
    INFO  11:51:48,757 ProgressMeter -   chr8:72999280   8.3432657E7     4.5 m       3.0 s       45.5%     9.9 m       5.4 m 
    INFO  11:52:18,758 ProgressMeter -  chr9:108999126   9.279247E7     5.0 m       3.0 s       51.1%     9.8 m       4.8 m 
    INFO  11:52:48,760 ProgressMeter -   chr11:1999326   1.025952E8     5.5 m       3.0 s       56.3%     9.8 m       4.3 m 
    INFO  11:53:18,761 ProgressMeter -  chr12:43998134   1.13018674E8     6.0 m       3.0 s       61.8%     9.7 m       3.7 m 
    INFO  11:53:48,762 ProgressMeter -  chr13:78998546   1.2172114E8     6.5 m       3.0 s       67.0%     9.7 m       3.2 m 
    INFO  11:54:18,764 ProgressMeter -  chr15:67227481   1.30914593E8     7.0 m       3.0 s       73.5%     9.5 m       2.5 m 
    INFO  11:54:48,765 ProgressMeter -  chr17:36999795   1.41449746E8     7.5 m       3.0 s       78.6%     9.5 m       2.0 m 
    INFO  11:55:18,767 ProgressMeter -  chr19:53999801   1.52365069E8     8.0 m       3.0 s       84.2%     9.5 m      90.0 s 
    INFO  11:55:48,769 ProgressMeter -   chrX:15998059   1.61757085E8     8.5 m       3.0 s       89.9%     9.5 m      57.0 s 
    INFO  11:56:18,770 ProgressMeter -   chrY:26672530   1.67465903E8     9.0 m       3.0 s       95.0%     9.5 m      28.0 s 
    INFO  11:56:48,772 ProgressMeter - chr2_GL383522v1_alt:35618   1.67490576E8     9.5 m       3.0 s       96.5%     9.8 m      20.0 s 
    INFO  11:57:18,773 ProgressMeter - chr17_KI270909v1_alt:224256   1.67493678E8    10.0 m       3.0 s       98.5%    10.1 m       8.0 s 
    INFO  11:57:48,774 ProgressMeter - chr19_KI270938v1_alt:195720   1.67493681E8    10.5 m       3.0 s       99.7%    10.5 m       1.0 s 
    INFO  11:58:18,776 ProgressMeter - chr19_KI270938v1_alt:195720   1.67493681E8    11.0 m       3.0 s       99.7%    11.0 m       1.0 s 
    INFO  11:58:48,777 ProgressMeter - chr19_KI270938v1_alt:195720   1.67493681E8    11.5 m       4.0 s       99.7%    11.5 m       1.0 s 
    INFO  11:59:18,778 ProgressMeter - chr19_KI270938v1_alt:195720   1.67493681E8    12.0 m       4.0 s       99.7%    12.0 m       2.0 s 
    INFO  11:59:48,782 ProgressMeter - chr19_KI270938v1_alt:195720   1.67493681E8    12.5 m       4.0 s       99.7%    12.5 m       2.0 s 
    INFO  12:00:18,783 ProgressMeter - chr19_KI270938v1_alt:195720   1.67493681E8    13.0 m       4.0 s       99.7%    13.0 m       2.0 s 
    INFO  12:00:48,785 ProgressMeter - chr19_KI270938v1_alt:195720   1.67493681E8    13.5 m       4.0 s       99.7%    13.5 m       2.0 s 
    INFO  12:01:18,786 ProgressMeter - chr19_KI270938v1_alt:195720   1.67493681E8    14.0 m       5.0 s       99.7%    14.0 m       2.0 s 
    INFO  12:01:48,787 ProgressMeter - chr19_KI270938v1_alt:195720   1.67493681E8    14.5 m       5.0 s       99.7%    14.5 m       2.0 s 
    INFO  12:02:18,788 ProgressMeter - chr19_KI270938v1_alt:195720   1.67493681E8    15.0 m       5.0 s       99.7%    15.0 m       2.0 s 
    INFO  12:02:48,789 ProgressMeter - chr19_KI270938v1_alt:195720   1.67493681E8    15.5 m       5.0 s       99.7%    15.5 m       2.0 s 
    INFO  12:03:18,790 ProgressMeter - chr19_KI270938v1_alt:195720   1.67493681E8    16.0 m       5.0 s       99.7%    16.0 m       2.0 s 
    INFO  12:03:48,791 ProgressMeter - chr19_KI270938v1_alt:195720   1.67493681E8    16.5 m       5.0 s       99.7%    16.5 m       2.0 s 
    INFO  12:04:15,941 VariantDataManager - QD:      mean = 16.98    standard deviation = 7.37 
    INFO  12:04:16,010 VariantDataManager - FS:      mean = 1.63     standard deviation = 3.58 
    INFO  12:04:16,057 VariantDataManager - SOR:     mean = 1.14     standard deviation = 1.18 
    INFO  12:04:16,120 VariantDataManager - MQ:      mean = 1.68     standard deviation = 0.36 
    INFO  12:04:16,170 VariantDataManager - MQRankSum:       mean = -0.04    standard deviation = 0.48 
    INFO  12:04:16,221 VariantDataManager - ReadPosRankSum:          mean = 0.12     standard deviation = 0.68 
    INFO  12:04:16,272 VariantDataManager - InbreedingCoeff:         mean = 0.01     standard deviation = 0.20 
    INFO  12:04:16,614 VariantDataManager - Annotations are now ordered by their information content: [QD, MQ, SOR, FS, InbreedingCoeff, ReadPosRankSum, MQRankSum] 
    INFO  12:04:16,641 VariantDataManager - Training with 174384 variants after standard deviation thresholding. 
    INFO  12:04:16,645 GaussianMixtureModel - Initializing model with 100 k-means iterations... 
    INFO  12:04:18,792 ProgressMeter - chr19_KI270938v1_alt:195720   1.67493681E8    17.0 m       6.0 s       99.7%    17.0 m       2.0 s 
    INFO  12:04:27,914 VariantRecalibratorEngine - Finished iteration 0. 
    INFO  12:04:33,306 VariantRecalibratorEngine - Finished iteration 5.    Current change in mixture coefficients = 0.66335 
    INFO  12:04:38,774 VariantRecalibratorEngine - Finished iteration 10.   Current change in mixture coefficients = 0.25132 
    INFO  12:04:44,515 VariantRecalibratorEngine - Finished iteration 15.   Current change in mixture coefficients = 0.10154 
    INFO  12:04:48,793 ProgressMeter - chr19_KI270938v1_alt:195720   1.67493681E8    17.5 m       6.0 s       99.7%    17.5 m       2.0 s 
    INFO  12:04:50,358 VariantRecalibratorEngine - Finished iteration 20.   Current change in mixture coefficients = 0.02245 
    INFO  12:04:56,186 VariantRecalibratorEngine - Finished iteration 25.   Current change in mixture coefficients = 0.04004 
    INFO  12:05:02,001 VariantRecalibratorEngine - Finished iteration 30.   Current change in mixture coefficients = 0.06854 
    INFO  12:05:07,826 VariantRecalibratorEngine - Finished iteration 35.   Current change in mixture coefficients = 0.05058 
    INFO  12:05:13,634 VariantRecalibratorEngine - Finished iteration 40.   Current change in mixture coefficients = 0.01001 
    INFO  12:05:18,794 ProgressMeter - chr19_KI270938v1_alt:195720   1.67493681E8    18.0 m       6.0 s       99.7%    18.1 m       3.0 s 
    INFO  12:05:19,437 VariantRecalibratorEngine - Finished iteration 45.   Current change in mixture coefficients = 0.00519 
    INFO  12:05:25,235 VariantRecalibratorEngine - Finished iteration 50.   Current change in mixture coefficients = 0.00176 
    INFO  12:05:25,236 VariantRecalibratorEngine - Convergence after 50 iterations! 
    INFO  12:05:26,006 VariantRecalibratorEngine - Evaluating full set of 516353 variants... 
    INFO  12:05:32,847 VariantDataManager - Training with worst 17878 scoring variants --> variants with LOD <= -5.0000. 
    INFO  12:05:32,848 GaussianMixtureModel - Initializing model with 100 k-means iterations... 
    INFO  12:05:33,031 VariantRecalibratorEngine - Finished iteration 0. 
    INFO  12:05:33,113 VariantRecalibratorEngine - Finished iteration 5.    Current change in mixture coefficients = 0.00753 
    INFO  12:05:33,192 VariantRecalibratorEngine - Finished iteration 10.   Current change in mixture coefficients = 0.01562 
    INFO  12:05:33,270 VariantRecalibratorEngine - Finished iteration 15.   Current change in mixture coefficients = 0.01283 
    INFO  12:05:33,349 VariantRecalibratorEngine - Finished iteration 20.   Current change in mixture coefficients = 0.01727 
    INFO  12:05:33,428 VariantRecalibratorEngine - Finished iteration 25.   Current change in mixture coefficients = 0.01716 
    INFO  12:05:33,507 VariantRecalibratorEngine - Finished iteration 30.   Current change in mixture coefficients = 0.01228 
    INFO  12:05:33,588 VariantRecalibratorEngine - Finished iteration 35.   Current change in mixture coefficients = 0.00599 
    INFO  12:05:33,669 VariantRecalibratorEngine - Finished iteration 40.   Current change in mixture coefficients = 0.00223 
    INFO  12:05:33,686 VariantRecalibratorEngine - Convergence after 41 iterations! 
    INFO  12:05:33,714 VariantRecalibratorEngine - Evaluating full set of 516353 variants... 
    INFO  12:05:41,145 TrancheManager - Finding 4 tranches for 516353 variants 
    INFO  12:05:41,460 TrancheManager -   Tranche threshold 100.00 => selection metric threshold 0.000 
    INFO  12:05:41,533 TrancheManager -   Found tranche for 100.000: 0.000 threshold starting with variant 0; running score is 0.000  
    INFO  12:05:41,533 TrancheManager -   Tranche is Tranche ts=100.00 minVQSLod=-39768.4476 known=(283569 @ 1.1998) novel=(232784 @ 0.0791) truthSites(85270 accessible, 85270 called), name=anonymous] 
    INFO  12:05:41,534 TrancheManager -   Tranche threshold 99.90 => selection metric threshold 0.001 
    INFO  12:05:41,605 TrancheManager -   Found tranche for 99.900: 0.001 threshold starting with variant 9680; running score is 0.001  
    INFO  12:05:41,605 TrancheManager -   Tranche is Tranche ts=99.90 minVQSLod=-2.1611 known=(276686 @ 1.2012) novel=(229987 @ 0.0783) truthSites(85270 accessible, 85184 called), name=anonymous] 
    INFO  12:05:41,606 TrancheManager -   Tranche threshold 99.00 => selection metric threshold 0.010 
    INFO  12:05:41,654 TrancheManager -   Found tranche for 99.000: 0.010 threshold starting with variant 26291; running score is 0.010  
    INFO  12:05:41,655 TrancheManager -   Tranche is Tranche ts=99.00 minVQSLod=0.5630 known=(266150 @ 1.1971) novel=(223912 @ 0.0777) truthSites(85270 accessible, 84417 called), name=anonymous] 
    INFO  12:05:41,655 TrancheManager -   Tranche threshold 90.00 => selection metric threshold 0.100 
    INFO  12:05:41,705 TrancheManager -   Found tranche for 90.000: 0.100 threshold starting with variant 108351; running score is 0.100  
    INFO  12:05:41,706 TrancheManager -   Tranche is Tranche ts=90.00 minVQSLod=7.1301 known=(226459 @ 1.2340) novel=(181543 @ 0.0775) truthSites(85270 accessible, 76743 called), name=anonymous] 
    INFO  12:05:41,709 VariantRecalibrator - Writing out recalibration table... 
    INFO  12:05:45,516 VariantRecalibrator - Writing out visualization Rscript file... 
    INFO  12:05:45,580 VariantRecalibrator - Building QD x MQ plot... 
    INFO  12:05:45,585 VariantRecalibratorEngine - Evaluating full set of 3660 variants... 
    INFO  12:05:46,941 VariantRecalibratorEngine - Evaluating full set of 3660 variants... 
    INFO  12:05:47,363 VariantRecalibrator - Building QD x SOR plot... 
    INFO  12:05:47,366 VariantRecalibratorEngine - Evaluating full set of 3660 variants... 
    INFO  12:05:48,719 VariantRecalibratorEngine - Evaluating full set of 3660 variants... 
    INFO  12:05:48,795 ProgressMeter - chr19_KI270938v1_alt:195720   1.67493681E8    18.5 m       6.0 s       99.7%    18.6 m       3.0 s 
    INFO  12:05:49,102 VariantRecalibrator - Building QD x FS plot... 
    INFO  12:05:49,103 VariantRecalibratorEngine - Evaluating full set of 3660 variants... 
    INFO  12:05:50,466 VariantRecalibratorEngine - Evaluating full set of 3660 variants... 
    INFO  12:05:50,829 VariantRecalibrator - Building QD x InbreedingCoeff plot... 
    INFO  12:05:50,830 VariantRecalibratorEngine - Evaluating full set of 3600 variants... 
    INFO  12:05:52,164 VariantRecalibratorEngine - Evaluating full set of 3600 variants... 
    INFO  12:05:52,535 VariantRecalibrator - Building QD x ReadPosRankSum plot... 
    INFO  12:05:52,536 VariantRecalibratorEngine - Evaluating full set of 3660 variants... 
    INFO  12:05:53,902 VariantRecalibratorEngine - Evaluating full set of 3660 variants... 
    INFO  12:05:54,254 VariantRecalibrator - Building QD x MQRankSum plot... 
    INFO  12:05:54,255 VariantRecalibratorEngine - Evaluating full set of 3600 variants... 
    INFO  12:05:55,553 VariantRecalibratorEngine - Evaluating full set of 3600 variants... 
    INFO  12:05:55,906 VariantRecalibrator - Building MQ x SOR plot... 
    INFO  12:05:55,907 VariantRecalibratorEngine - Evaluating full set of 3721 variants... 
    INFO  12:05:57,285 VariantRecalibratorEngine - Evaluating full set of 3721 variants... 
    INFO  12:05:57,637 VariantRecalibrator - Building MQ x FS plot... 
    INFO  12:05:57,638 VariantRecalibratorEngine - Evaluating full set of 3721 variants... 
    INFO  12:05:59,022 VariantRecalibratorEngine - Evaluating full set of 3721 variants... 
    INFO  12:05:59,380 VariantRecalibrator - Building MQ x InbreedingCoeff plot... 
    INFO  12:05:59,381 VariantRecalibratorEngine - Evaluating full set of 3660 variants... 
    INFO  12:06:00,739 VariantRecalibratorEngine - Evaluating full set of 3660 variants... 
    INFO  12:06:01,103 VariantRecalibrator - Building MQ x ReadPosRankSum plot... 
    INFO  12:06:01,103 VariantRecalibratorEngine - Evaluating full set of 3721 variants... 
    INFO  12:06:02,489 VariantRecalibratorEngine - Evaluating full set of 3721 variants... 
    INFO  12:06:02,846 VariantRecalibrator - Building MQ x MQRankSum plot... 
    INFO  12:06:02,847 VariantRecalibratorEngine - Evaluating full set of 3660 variants... 
    INFO  12:06:04,165 VariantRecalibratorEngine - Evaluating full set of 3660 variants... 
    INFO  12:06:04,518 VariantRecalibrator - Building SOR x FS plot... 
    INFO  12:06:04,519 VariantRecalibratorEngine - Evaluating full set of 3721 variants... 
    INFO  12:06:05,906 VariantRecalibratorEngine - Evaluating full set of 3721 variants... 
    INFO  12:06:06,259 VariantRecalibrator - Building SOR x InbreedingCoeff plot... 
    INFO  12:06:06,260 VariantRecalibratorEngine - Evaluating full set of 3660 variants... 
    INFO  12:06:07,619 VariantRecalibratorEngine - Evaluating full set of 3660 variants... 
    INFO  12:06:07,984 VariantRecalibrator - Building SOR x ReadPosRankSum plot... 
    INFO  12:06:07,985 VariantRecalibratorEngine - Evaluating full set of 3721 variants... 
    INFO  12:06:09,367 VariantRecalibratorEngine - Evaluating full set of 3721 variants... 
    INFO  12:06:09,724 VariantRecalibrator - Building SOR x MQRankSum plot... 
    INFO  12:06:09,725 VariantRecalibratorEngine - Evaluating full set of 3660 variants... 
    INFO  12:06:11,043 VariantRecalibratorEngine - Evaluating full set of 3660 variants... 
    INFO  12:06:11,388 VariantRecalibrator - Building FS x InbreedingCoeff plot... 
    INFO  12:06:11,388 VariantRecalibratorEngine - Evaluating full set of 3660 variants... 
    INFO  12:06:12,747 VariantRecalibratorEngine - Evaluating full set of 3660 variants... 
    INFO  12:06:13,112 VariantRecalibrator - Building FS x ReadPosRankSum plot... 
    INFO  12:06:13,113 VariantRecalibratorEngine - Evaluating full set of 3721 variants... 
    INFO  12:06:14,498 VariantRecalibratorEngine - Evaluating full set of 3721 variants... 
    INFO  12:06:14,866 VariantRecalibrator - Building FS x MQRankSum plot... 
    INFO  12:06:14,867 VariantRecalibratorEngine - Evaluating full set of 3660 variants... 
    INFO  12:06:16,183 VariantRecalibratorEngine - Evaluating full set of 3660 variants... 
    INFO  12:06:16,529 VariantRecalibrator - Building InbreedingCoeff x ReadPosRankSum plot... 
    INFO  12:06:16,530 VariantRecalibratorEngine - Evaluating full set of 3660 variants... 
    INFO  12:06:17,893 VariantRecalibratorEngine - Evaluating full set of 3660 variants... 
    INFO  12:06:18,256 VariantRecalibrator - Building InbreedingCoeff x MQRankSum plot... 
    INFO  12:06:18,256 VariantRecalibratorEngine - Evaluating full set of 3600 variants... 
    INFO  12:06:18,796 ProgressMeter - chr19_KI270938v1_alt:195720   1.67493681E8    19.0 m       6.0 s       99.7%    19.1 m       3.0 s 
    INFO  12:06:19,554 VariantRecalibratorEngine - Evaluating full set of 3600 variants... 
    INFO  12:06:19,912 VariantRecalibrator - Building ReadPosRankSum x MQRankSum plot... 
    INFO  12:06:19,912 VariantRecalibratorEngine - Evaluating full set of 3660 variants... 
    INFO  12:06:21,235 VariantRecalibratorEngine - Evaluating full set of 3660 variants... 
    INFO  12:06:21,586 VariantRecalibrator - Executing: Rscript recalibrate_SNP_plots.R 
    INFO  12:06:46,699 VariantRecalibrator - Executing: Rscript (resource)org/broadinstitute/gatk/tools/walkers/variantrecalibration/plot_Tranches.R recalibrate_SNP.tranches 3.2 
    INFO  12:06:47,120 ProgressMeter -            done   1.67493763E8    19.5 m       6.0 s       99.7%    19.5 m       3.0 s 
    INFO  12:06:47,121 ProgressMeter - Total runtime 1168.38 secs, 19.47 min, 0.32 hours 
    ------------------------------------------------------------------------------------------
    Done. There were 4 WARN messages, the first 4 are repeated below.
    WARN  11:47:17,288 IndexDictionaryUtils - Track hapmap doesn't have a sequence dictionary built in, skipping dictionary validation 
    WARN  11:47:17,288 IndexDictionaryUtils - Track omni doesn't have a sequence dictionary built in, skipping dictionary validation 
    WARN  11:47:17,289 IndexDictionaryUtils - Track 1000G doesn't have a sequence dictionary built in, skipping dictionary validation 
    WARN  11:47:17,289 IndexDictionaryUtils - Track dbsnp doesn't have a sequence dictionary built in, skipping dictionary validation 
    ------------------------------------------------------------------------------------------
    

    Here I applied the model.

    GenomeAnalysisTK.jar -T ApplyRecalibration -R ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/hs38DH.fa --input mysamples.raw.vcf -mode SNP --ts_filter_level 99.0 -recalFile recalibrate_SNP.recal -tranchesFile recalibrate_SNP.tranches -o mysamples.recalibrated_snps_raw_indels.vcf
    INFO  12:06:50,949 HelpFormatter - -------------------------------------------------------------------------------------------- 
    INFO  12:06:50,951 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.7-0-gcfedb67, Compiled 2016/12/12 11:21:18 
    INFO  12:06:50,951 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute 
    INFO  12:06:50,951 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk 
    INFO  12:06:50,951 HelpFormatter - [Wed Apr 26 12:06:50 CEST 2017] Executing on Linux 2.6.32-642.15.1.el6.Bull.110.x86_64 amd64 
    INFO  12:06:50,952 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_121-b13 
    INFO  12:06:50,955 HelpFormatter - Program Args: -T ApplyRecalibration -R ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/hs38DH.fa --input mysamples.raw.vcf -mode SNP --ts_filter_level 99.0 -recalFile recalibrate_SNP.recal -tranchesFile recalibrate_SNP.tranches -o recalibrated_snps_raw_indels.vcf 
    INFO  12:06:50,959 HelpFormatter - Executing as [email protected] on Linux 2.6.32-642.15.1.el6.Bull.110.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_121-b13. 
    INFO  12:06:50,959 HelpFormatter - Date/Time: 2017/04/26 12:06:50 
    INFO  12:06:50,959 HelpFormatter - -------------------------------------------------------------------------------------------- 
    INFO  12:06:50,959 HelpFormatter - -------------------------------------------------------------------------------------------- 
    INFO  12:06:50,978 GenomeAnalysisEngine - Strictness is SILENT 
    INFO  12:06:52,408 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 
    INFO  12:06:54,472 GenomeAnalysisEngine - Preparing for traversal 
    INFO  12:06:54,479 GenomeAnalysisEngine - Done preparing for traversal 
    INFO  12:06:54,479 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
    INFO  12:06:54,480 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining 
    INFO  12:06:54,480 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime 
    INFO  12:06:54,484 ApplyRecalibration - Read tranche Tranche ts=90.00 minVQSLod=7.1301 known=(226459 @ 1.2340) novel=(181543 @ 0.0775) truthSites(85270 accessible, 76743 called), name=VQSRTrancheSNP0.00to90.00] 
    INFO  12:06:54,485 ApplyRecalibration - Read tranche Tranche ts=99.00 minVQSLod=0.5630 known=(266150 @ 1.1971) novel=(223912 @ 0.0777) truthSites(85270 accessible, 84417 called), name=VQSRTrancheSNP90.00to99.00] 
    INFO  12:06:54,485 ApplyRecalibration - Read tranche Tranche ts=99.90 minVQSLod=-2.1611 known=(276686 @ 1.2012) novel=(229987 @ 0.0783) truthSites(85270 accessible, 85184 called), name=VQSRTrancheSNP99.00to99.90] 
    INFO  12:06:54,486 ApplyRecalibration - Read tranche Tranche ts=100.00 minVQSLod=-39768.4476 known=(283569 @ 1.1998) novel=(232784 @ 0.0791) truthSites(85270 accessible, 85270 called), name=VQSRTrancheSNP99.90to100.00] 
    INFO  12:06:54,519 ApplyRecalibration - Keeping all variants in tranche Tranche ts=99.00 minVQSLod=0.5630 known=(266150 @ 1.1971) novel=(223912 @ 0.0777) truthSites(85270 accessible, 84417 called), name=VQSRTrancheSNP90.00to99.00] 
    INFO  12:07:24,483 ProgressMeter -    chrX:2781927    596978.0    30.0 s      50.0 s       89.4%    33.0 s       3.0 s 
    INFO  12:07:26,138 ProgressMeter -            done    615613.0    31.0 s      51.0 s       99.7%    31.0 s       0.0 s 
    INFO  12:07:26,139 ProgressMeter - Total runtime 31.66 secs, 0.53 min, 0.01 hours 
    ------------------------------------------------------------------------------------------
    Done. There were no warn messages.
    ------------------------------------------------------------------------------------------
    

    I am not showing the INDEL VQSR step but I assume it is similar to the above.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @mmokrejs
    Hi,

    I have one more thing to try. Can you set -resource:omni,known=false,training=true,truth=true,prior=12.0 instead of -resource:omni,known=false,training=true,truth=false,prior=12.0?

    Thanks,
    Sheila

  • mmokrejsmmokrejs Czech RepublicMember

    @Sheila
    That did not help either.

    Can I use https://www.illumina.com/platinumgenomes.html , namely ussd-ftp.illumina.com/2016-1.0/hg38/small_variants/NA12877/NA12877.vcf and ussd-ftp.illumina.com/2016-1.0/hg38/small_variants/NA12878/NA12878.vcf ? From reading VQSR docs I am not certain what annotations are necessary in the VCF file for VariantRecalibrator. Just the sites?

    Thank you.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @mmokrejs
    Hi,

    I have one other thing for you to try. I will also ask my team for other suggestions. In this thread the user has the opposite problem, but her issue was solved by setting the TiTv ratio in her ApplyRecalibration command. Perhaps you can try setting --target_titv 3.

    -Sheila

  • mmokrejsmmokrejs Czech RepublicMember

    Hi @Sheila

    her issue was solved by setting the TiTv ratio in her ApplyRecalibration command. Perhaps you can try setting --target_titv 3

    No, I do not see any large differenc ein the PDF plots after the change from 3.2 to 3.0.

    Issue · Github
    by Sheila

    Issue Number
    2040
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @mmokrejs
    Hi,

    I am checking with the team. We will get back to you soon.

    -Sheila

  • mmokrejsmmokrejs Czech RepublicMember

    BTW from a quick reading the supplementary material of http://science.sciencemag.org/content/early/2016/03/02/science.aac8624 speaks on page 5 about Ti/Tv ratio about 2.5 in same Agilent Human Exon v5 data.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
    edited May 2017

    @mmokrejs
    Hi,

    Does setting the TiTv ratio to 2.5 help with the plots?

    -Sheila

  • @Nanda said:
    Hi,

    Recently we performed exome sequencing using Nextera Illumina platform for three samples (Father, Mother and Son). I downloaded the exome interval list from Illumina's website.

    1) Trimmed the raw reads
    2) Aligned the trimmed reads against the human reference hg19 as recommended for exome-sequencing

    Hello, could someone please direct me to a reference that compares or states this? I have tried searching for comparisons between regarding which reference genome to use for mapping and have not found any definitive answers like this Nanda stated. Also maybe things changed from 2012?.... I am also mapping exome reads (SureSelectV5 and V6). Thank you!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Frank_cumc
    Hi,

    I am not sure about any paper @Nanda may have been referencing, but have a look at this article that may help.

    -Sheila

  • RosmaninhoRosmaninho Member

    Hello, I've also used hg19 to align my reads and am now doing haplotype calling for 9 samples plus 1000genomes treated bams to reach the 30 samples threshold.
    What interval-list file should I use in Haplotypecaller with the parameter -L ?
    I didn't understand if this is a requirement as well.

  • shleeshlee CambridgeMember, Administrator, Broadie, Moderator admin
    edited May 24

    Hi @Rosmaninho,

    The -L parameter is optional with HaplotypeCaller, e.g. if you wanted to quickly check the calls for a small region of the genome or if you are scattering over the genome, e.g. chr1. If you are calling on targeted exomes, then you can always subset variants to your on-target intervals of interest after the calling.

    P.S. Actually, for hg19, I believe there are regions that amass reads in a manner that is problematic for most callers. These types of problem regions have been ironed out in GRCh38. If you are stuck using hg19, then you should try to find 'calling regions' intervals that omit these problematic regions. In GRCh38, you should be able to call on the entirety of the reference (sans N regions of course) without issue.

  • RosmaninhoRosmaninho Member

    From what I've understood the -L parameter is recommended to use with WES files.
    Since the extracted DNA from my samples was hybridized to the Agilent SeqCap EZ Library, I fetched the SeqCapEZ_Exome_v3.0_Design_Annotation_files (http://sequencing.roche.com/en/products-solutions/by-category/target-enrichment/hybridization/seqcap-ez-exome-v3-kit.html) and used the SeqCap_EZ_Exome_v3_hg19_primary_targets.bed in the -L parameter.

    This file contains the design primary target (unpadded) in hg19 coordinates and gene annotation in the 4th column.

    Since I only have 9 WES samples, I also ran the Haplotypecaller for 21 bam files with a similar geographic background to my samples, from the 1000Genomes, realigned to hg19. I also ran against the same bed file that I ran for my 9 samples.

    Do you think this is the correct approach?
    Should I run the 1000Genomes samples against the bed file that was referenced in the published paper?

    Furthermore, since we are considering adding extra samples to my analysis, we are in the decision process of performing WGS or WES in those extra samples. If we pick WGS is it possible to compare the results from our WES samples to the WGS samples?

  • shleeshlee CambridgeMember, Administrator, Broadie, Moderator admin

    @Rosmaninho,

    Let me summarize your scenario. You want to call short variants on a cohort consisting of:

    • 9 targeted exome samples from capture kit A and
    • 21 exome samples from the 1000 Genomes Project that used capture kit B

    And use an intervals list with HaplotypeCaller consisting of:

    • unpadded targets from capture kit A

    Depending on your research aims, you may want to consider comparing the capture regions of kit A and B, as I do for the Hammer CNV tutorial, e.g.:

    and think about the implications of the extent of overlap for your analysis. For example, is filling up the squared-off matrix feature of HaplotypeCaller, that is possible in ERC mode, more important than getting every call even if some samples may not have coverage for calls, etc. Alternatively, to make the most of HaplotypeCaller compute, perhaps you might consider calling on the entirety of callable regions without applying the targets list for each sample in ERC GVCF mode, then in GenotypeGVCFs mode apply your intervals list of choice. Here you could test multiple intervals lists on the GVCFs with GenotypeGVCFs and would have saved on the upfront compute.

    I'm not all that familiar with the recommended practices for germline calling in hg19 for targeted exomes as I've only written about GRCh38 WGS calling and used exomes for CNV analysis. Perhaps @Sheila can add to this discussion.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Rosmaninho @shlee
    Hi,

    I think this thread will help as well.

    -Sheila

  • @shlee said:
    @Rosmaninho,

    Let me summarize your scenario. You want to call short variants on a cohort consisting of:

    • 9 targeted exome samples from capture kit A and
    • 21 exome samples from the 1000 Genomes Project that used capture kit B

    And use an intervals list with HaplotypeCaller consisting of:

    • unpadded targets from capture kit A

    Depending on your research aims, you may want to consider comparing the capture regions of kit A and B, as I do for the Hammer CNV tutorial, e.g.:

    and think about the implications of the extent of overlap for your analysis. For example, is filling up the squared-off matrix feature of HaplotypeCaller, that is possible in ERC mode, more important than getting every call even if some samples may not have coverage for calls, etc. Alternatively, to make the most of HaplotypeCaller compute, perhaps you might consider calling on the entirety of callable regions without applying the targets list for each sample in ERC GVCF mode, then in GenotypeGVCFs mode apply your intervals list of choice. Here you could test multiple intervals lists on the GVCFs with GenotypeGVCFs and would have saved on the upfront compute.

    I'm not all that familiar with the recommended practices for germline calling in hg19 for targeted exomes as I've only written about GRCh38 WGS calling and used exomes for CNV analysis. Perhaps @Sheila can add to this discussion.

    Well, since I was having some serious issues in my QC by using hg19 annotation I re-did everything with b37. the 1000genomes samples are in b37 annotation and I thought it would be better to do it this way.

    I converted the bed files from SeqCap_EZ_Exome_v3_hg19_capture_targets.bed and the exome targets from the 1000genomes to b37 annotation and merged them using bedops. I am now running HC with this bed file as the interval list and padding of 100bp.

  • shleeshlee CambridgeMember, Administrator, Broadie, Moderator admin

    Hope that goes well @Rosmaninho. Thanks for the update.

  • No, thank you for your availability to help!
    I'm new to this so I expect to bang against the wall a few times. :)

  • Well, finished everything, ran Picard GenotypeConcordance and still can't get concordance above 0.4 compared to NA12878.knowledgebase.snapshot.20131119.b37.vcf
    Comparing NA12878 vcf generated by me, the concordance is 0.86. I don't know what I'm missing.

  • shleeshlee CambridgeMember, Administrator, Broadie, Moderator admin

    Hi @Rosmaninho,

    I'm not familiar with the entirety of this thread but I should think you need to survey concordance based on intervals that are an intersection of your capture kit's interval list and the interval list used to generate the NA12878.knowledgebase.snapshot.20131119.b37.vcf resource. Taking a peek inside the latter:

    1       874501  .       C       T       .       .       CallSetName=AffyExomePlus_MONO,BroadExomeLOF_MONO;Confidence=0.00;Date=1384848057056;PolymorphicStatus=UNKNOWN;TruthStatus=FALSE_POSITIVE       GT      ./.
    1       874714  .       C       T       .       .       CallSetName=AffyExomePlus_MONO;Confidence=0.00;Date=1384848057058;PolymorphicStatus=UNKNOWN;TruthStatus=FALSE_POSITIVE  GT      ./.
    1       874809  .       G       C       .       .       CallSetName=AffyExomePlus_MONO;Confidence=0.00;Date=1384848057060;PolymorphicStatus=UNKNOWN;TruthStatus=FALSE_POSITIVE  GT      ./.
    1       874816  .       C       CT      .       .       CallSetName=Mills_1000G_GS_indels,largeScaleValidationPools_POLY;Confidence=0.00;Date=1384848057062;PolymorphicStatus=UNKNOWN;TruthStatus=UNKNOWN       GT      ./.
    1       874817  .       C       T       .       .       CallSetName=AffyExomePlus_MONO;Confidence=0.00;Date=1384848057064;PolymorphicStatus=UNKNOWN;TruthStatus=FALSE_POSITIVE  GT      ./.
    1       876498  .       GA      G       .       .       CallSetName=AffyExomePlus_POLY;Confidence=0.00;Date=1384848057066;PolymorphicStatus=POLYMORPHIC;TruthStatus=TRUE_POSITIVE       GT      1/1
    1       876499  .       A       G       .       .       CallSetName=1000GPilot1Liftover,NIST_GenomesInABottle;Confidence=0.00;Date=1384848057068;PolymorphicStatus=POLYMORPHIC;Tr
    

    We see this resource is some union of calls from different capture kits. I see in the GATK Resource Bundle one intervals list:

    I cannot say whether this intervals list represents the entirety of genomic intervals of the knowledgebase data but it is a good bet that it covers the high confidence regions.

  • RosmaninhoRosmaninho Member
    edited September 12

    Ok, but what is the acceptable result for GenotypeConcordance with PicardTools? What should I look for?

    Because while your documentation is very clear in terms of variant concordance I don't understand what to look for in Genotype Concordance. In Doc #11071 it says

    Genotype concordance is a similar metric but operates at the genotype level. It allows you to evaluate, within a set of variant calls that are present in both your sample callset and your truth set, what proportion of the genotype calls have been assigned correctly. This assumes that you are comparing your sample to a matched truth set derived from the same original sample.

    What is the sample callset that I should evaluate and what's the truth set I should compare it to? When I compare one of my call samples (lets say Exome 1) to a truth sample (ex. NA12878) I never get a genotype Concordance bigger than 0.4. Is this to be expected from different individuals?
    Or should I compare a .bam file treated together with mine (ex. NA12878 together with my sampels and a bunch of other 1kgenomes) to the NA12878 in the resource bundle?

    I mean, for variant concordance there's some expected value ranges for germline data in doc #11071 - number of variants, TiTv ratio and ratio of insertions to deletions. I don't know what to look for in GenotypeConcordance. it's annoying because it's a recommended step in the best practices before using variants in a project but I don't know what to expect.

    I'll just throw in the towel and ignore GenotypeConcordance in my QC of the variants I called. I've been stuck in this for a while so I'll just move on anyway. I did everything according to the best practices and with all the material from the resource bundle anyway. Also I have a pair of twins in my samples and their GenotypeConcordance is high so I'll assume that everything is fine.

  • shleeshlee CambridgeMember, Administrator, Broadie, Moderator admin

    Hi @Rosmaninho,

    If you are comparing NA12878 to NA12878, then you should expect high genotype concordance. If you are comparing different samples, then you cannot expect high genotype concordance.

  • Soooo, to create interval list that I used to do the haplotype calling and to analyze GenotypeConcordance I used bedops to merge the .bed file of Roche SeqCap EZ Exome kit 3.0 with the Phase3-like exome BED file present in the ftp server of 1kgenomes.

    ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/exome_pull_down_targets/20130108.exome.targets.bed.README

    So it's preferable to perform intersection instead of merging? It seemed to me that merging was the more correct option otherwise I could miss fragments present in my exome kit... I am also using the merger of these two bed files to analyze variant concordance.

  • shleeshlee CambridgeMember, Administrator, Broadie, Moderator admin

    @Rosmaninho,

    I've not tried bedops. Do you like it? I mostly use bedtools for interval manipulations. I would not use a merge but rather only those regions in common between the two target lists towards concordance calculations. Otherwise, your numbers may be artificially depressed. Consider the implications of comparing fragments in one kit that were not targeted by another kit. Variants in one would appear as false false negatives in the other. I suppose how you go about comparing two datasets depends on your ultimate aims. Are you trying to figure out if the sensitivity of tool-chains is comparable or are you trying to assess the difference in coverage between two capture kits?

  • Well, it merges bed files so it works for me... :D
    Never tried bedtools since it's not available in the docker images of my institution and I didn't want to add redundant tools.

    Indeed my variant concordance numbers are quite low... So when you mention

    I would not use a merge but rather only those regions in common between the two target lists towards concordance calculations.

    You mean in the Variant concordance calculations? How about in the HaplotypeCaller step? I decided to use the merge to adequately call the variants in the 1kgenomes samples but should I just restrict myself to the SEqcap EZ Exome bed coordinates?

    I am trying to call the variants in my own samples and just added 1kgenomes samples to achieve 30 samples.

    Issue · Github
    by shlee

    Issue Number
    3156
    State
    closed
    Last Updated
    Assignee
    Array
    Closed By
    sooheelee
  • shleeshlee CambridgeMember, Administrator, Broadie, Moderator admin
    edited September 14

    Happy Friday @Rosmaninho,

    Bedtools is available in the GATK Docker (docker pull broadinstitute/gatk:4.0.8.1 pulls the latest). You can invoke the tool with bedtools. In fact, there are many other useful tools available from the path in the GATK Docker besides GATK, e.g. samtools, gsutil and python. You can see the list with ls /usr/bin.

    You mean in the Variant concordance calculations? How about in the HaplotypeCaller step? I decided to use the merge to adequately call the variants in the 1kgenomes samples but should I just restrict myself to the SEqcap EZ Exome bed coordinates?

    So now you are asking about common practices in HaplotypeCaller. I believe there are a number of discussion threads in the forum about padding a sample with publically available samples so that there is enough data towards VQSR. See this thread and this thread to start. The former states:

    If you don't have enough of your own samples, consider using publicly available data (e.g. exome bams from the 1000 Genomes Project) to "pad" your cohort. Be aware however that you cannot simply merge in someone else's variant calls. You must joint-call variants from the original BAMs with your own samples. We recommend using the GVCF workflow to generate GVCFs from the original BAMs, and joint-call those GVCFs along with your own samples' GVCFs using GenotypeGVCFs.

    Let's think about the implications of mixing different capture kits. Given whole exome kits aim to comprehensively target all of the exons and other informative genomic regions, it is a safe bet that two whole exome kits will have similar intended genomic coverage. However, the efficiency of target capture and differences in sheering/sample processing may yield coverage profiles that look different (think tall narrow peak versus wide block around a target). When you perform joint calling, regions with good coverage across samples will give good results. However, regions with differential coverage across samples (think about the edges of targets) will comparatively lack the power of the cohort and any call from these regions will have comparatively subpar annotation values.

    If you compare the coverage across the genome for samples from the two coverage kits, do the peaks that lie within the target intervals have good coverage depth? One tool that can help towards answering this and other relevant questions is CollectTargetedPcrMetrics. If there is a large differential, you may want to look into padding your sample(s) with samples that have more similar profiles, towards more empowered joint calling.

    I hope this is helpful.

  • @shlee said:
    Happy Friday @Rosmaninho,

    Thanks for helping me right before the weekend. :)

    Bedtools is available in the GATK Docker (docker pull broadinstitute/gatk:4.0.8.1 pulls the latest). You can invoke the tool with bedtools. In fact, there are many other useful tools available from the path in the GATK Docker besides GATK, e.g. samtools, gsutil and python. You can see the list with ls /usr/bin.

    You mean in the Variant concordance calculations? How about in the HaplotypeCaller step? I decided to use the merge to adequately call the variants in the 1kgenomes samples but should I just restrict myself to the SEqcap EZ Exome bed coordinates?

    So now you are asking about common practices in HaplotypeCaller. I believe there are a number of discussion threads in the forum about padding a sample with publically available samples so that there is enough data towards VQSR. See this thread and this thread to start. The former states:

    I did HaplotypeCaller with padding of 50bp and padding of 20bp around the capture targets. Indeed padding of 20bp improved my variant caling metrics somewhat but my DB SNP TiTv ratio is still around 2.5 (with padding of 100bp it was around 2.4). Still far from the recommended 2.8-3. :/

    I also tested different --maxGaussians (4, 6, and 8) in the variant recalibration steps but the differences weren't so drastic.

    If you don't have enough of your own samples, consider using publicly available data (e.g. exome bams from the 1000 Genomes Project) to "pad" your cohort. Be aware however that you cannot simply merge in someone else's variant calls. You must joint-call variants from the original BAMs with your own samples. We recommend using the GVCF workflow to generate GVCFs from the original BAMs, and joint-call those GVCFs along with your own samples' GVCFs using GenotypeGVCFs.

    This is what I did. Obtained the bam files from samples with a similar geographic background (iberian) plus NA12878 and performed joint-calling to generate the GVCFs. I used CombineGVCFs instead of GenomicsDBImport though.

    Let's think about the implications of mixing different capture kits. Given whole exome kits aim to comprehensively target all of the exons and other informative genomic regions, it is a safe bet that two whole exome kits will have similar intended genomic coverage. However, the efficiency of target capture and differences in sheering/sample processing may yield coverage profiles that look different (think tall narrow peak versus wide block around a target). When you perform joint calling, regions with good coverage across samples will give good results. However, regions with differential coverage across samples (think about the edges of targets) will comparatively lack the power of the cohort and any call from these regions will have comparatively subpar annotation values.

    I decided to try two extra different configurations with padding surrounding my bed (merge of SeqCapEZ Exome V3.0 with 1kgenomes phase 3 bed which is a union of design files from two platforms (NimbleGen EZ_exome v1 and Agilent sure select v2)). So I should do the HaplotypeCaller with an intersect between these two files and not a merge.

    If you compare the coverage across the genome for samples from the two coverage kits, do the peaks that lie within the target intervals have good coverage depth? One tool that can help towards answering this and other relevant questions is CollectTargetedPcrMetrics. If there is a large differential, you may want to look into padding your sample(s) with samples that have more similar profiles, towards more empowered joint calling.

    I hope this is helpful.

  • @shlee said:
    @Rosmaninho,

    I've not tried bedops. Do you like it? I mostly use bedtools for interval manipulations. I would not use a merge but rather only those regions in common between the two target lists towards concordance calculations. Otherwise, your numbers may be artificially depressed. Consider the implications of comparing fragments in one kit that were not targeted by another kit. Variants in one would appear as false false negatives in the other. I suppose how you go about comparing two datasets depends on your ultimate aims. Are you trying to figure out if the sensitivity of tool-chains is comparable or are you trying to assess the difference in coverage between two capture kits?

    So If I understand correctly, for the QC purposes I should do the haplotype calling against a bed file made by bedtools of the intersection of the capture targets of my samples with the capture targets of the 1kgenomes.

    What if I focus the HaplotypeCaller exclusively on the bed file with the capture target regions of my samples?

  • How do you recommend that I intersect the bed files if I should focus solely on the intersections?

    Should I do a simple intersect or add some specific output?
    like betools intersect -a [file] -b [file] -wa

  • shleeshlee CambridgeMember, Administrator, Broadie, Moderator admin

    That's completely up to you @Rosmaninho. Here is the bedtools intersect documentation for your reference: https://bedtools.readthedocs.io/en/latest/content/tools/intersect.html.

  • RosmaninhoRosmaninho Member

    Thank you shlee, an intersect list of the hybridization chips used in my samples was indeed the best option.

    I used bedtools to obtain this list. It worked better than bedops for intersections.

  • shleeshlee CambridgeMember, Administrator, Broadie, Moderator admin

    Great to hear.

Sign In or Register to comment.