Merging batched call sets

delangeldelangel Posts: 71GATK Developer mod
edited May 2013 in Methods and Workflows

Introduction

Three-stage procedure:

  • Create a master set of sites from your N batch VCFs that you want to genotype in all samples. At this stage you need to determine how you want to resolve disagreements among the VCFs. This is your master sites VCF.

  • Take the master sites VCF and genotype each sample BAM file at these sites

  • (Optionally) Merge the single sample VCFs into a master VCF file

Creating the master set of sites: SNPs and Indels

The first step of batch merging is to create a master set of sites that you want to genotype in all samples. To make this problem concrete, suppose I have two VCF files:

Batch 1:

##fileformat=VCFv4.0
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA12891 
20      9999996     .       A       ATC     .       PASS    .       GT:GQ   0/1:30
20      10000000        .       T       G       .       PASS    .       GT:GQ   0/1:30
20      10000117        .       C       T       .       FAIL    .       GT:GQ   0/1:30
20      10000211        .       C       T       .       PASS    .       GT:GQ   0/1:30
20      10001436        .       A       AGG     .       PASS    .       GT:GQ   1/1:30

Batch 2:

##fileformat=VCFv4.0
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA12878
20      9999996     .       A       ATC     .       PASS    .       GT:GQ   0/1:30
20      10000117        .       C       T       .       FAIL    .       GT:GQ   0/1:30
20      10000211        .       C       T       .       FAIL    .       GT:GQ   0/1:30
20      10000598        .       T       A       .       PASS    .       GT:GQ   1/1:30
20      10001436        .       A       AGGCT   .       PASS    .       GT:GQ   1/1:30

In order to merge these batches, I need to make a variety of bookkeeping and filtering decisions, as outlined in the merged VCF below:

Master VCF:

20      9999996     .       A       ATC     .       PASS    .       GT:GQ   0/1:30  [pass in both]
20      10000000        .       T       G       .       PASS    .       GT:GQ   0/1:30  [only in batch 1]
20      10000117        .       C       T       .       FAIL    .       GT:GQ   0/1:30  [fail in both]
20      10000211        .       C       T       .       FAIL    .       GT:GQ   0/1:30  [pass in 1, fail in 2, choice in unclear]
20      10000598        .       T       A       .       PASS    .       GT:GQ   1/1:30  [only in batch 2]
20      10001436        .       A       AGGCT   .       PASS    .       GT:GQ   1/1:30  [A/AGG in batch 1, A/AGGCT in batch 2, including this site may be problematic]

These issues fall into the following categories:

  • For sites present in all VCFs (20:9999996 above), the alleles agree, and each site PASS is pass, this site can obviously be considered "PASS" in the master VCF
  • Some sites may be PASS in one batch, but absent in others (20:10000000 and 20:10000598), which occurs when the site is polymorphic in one batch but all samples are reference or no-called in the other batch
  • Similarly, sites that are fail in all batches in which they occur can be safely filtered out, or included as failing filters in the master VCF (20:10000117)

There are two difficult situations that must be addressed by the needs of the project merging batches:

  • Some sites may be PASS in some batches but FAIL in others. This might indicate that either:
  • The site is actually truly polymorphic, but due to limited coverage, poor sequencing, or other issues it is flag as unreliable in some batches. In these cases, it makes sense to include the site
  • The site is actually a common machine artifact, but just happened to escape standard filtering in a few batches. In these cases, you would obviously like to filter out the site
  • Even more complicated, it is possible that in the PASS batches you have found a reliable allele (C/T, for example) while in others there's no alt allele but actually a low-frequency error, which is flagged as failing. Ideally, here you could filter out the failing allele from the FAIL batches, and keep the pass ones
  • Some sites may have multiple segregating alleles in each batch. Such sites are often errors, but in some cases may be actual multi-allelic sites, in particular for indels.

Unfortunately, we cannot determine which is actually the correct choice, especially given the goals of the project. We leave it up the project bioinformatician to handle these cases when creating the master VCF. We are hopeful that at some point in the future we'll have a consensus approach to handle such merging, but until then this will be a manual process.

The GATK tool CombineVariants can be used to merge multiple VCF files, and parameter choices will allow you to handle some of the above issues. With tools like SelectVariants one can slice-and-dice the merged VCFs to handle these complexities as appropriate for your project's needs. For example, the above master merge can be produced with the following CombineVariants:

java -jar dist/GenomeAnalysisTK.jar \
-T CombineVariants \
-R human_g1k_v37.fasta \
-V:one,VCF combine.1.vcf -V:two,VCF combine.2.vcf \
--sites_only \
-minimalVCF \
-o master.vcf

producing the following VCF:

##fileformat=VCFv4.0
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
20      9999996     .       A       ACT         .       PASS    set=Intersection
20      10000000        .       T       G           .   PASS    set=one
20      10000117        .       C       T           .       FAIL    set=FilteredInAll
20      10000211        .       C       T           .       PASS    set=filterIntwo-one
20      10000598        .       T       A           .       PASS    set=two
20      10001436        .       A       AGG,AGGCT       .       PASS    set=Intersection

Genotyping your samples at these sites

Having created the master set of sites to genotype, along with their alleles, as in the previous section, you now use the UnifiedGenotyper to genotype each sample independently at the master set of sites. This GENOTYPE_GIVEN_ALLELES mode of the UnifiedGenotyper will jump into the sample BAM file, and calculate the genotype and genotype likelihoods of the sample at the site for each of the genotypes available for the REF and ALT alleles. For example, for site 10000211, the UnifiedGenotyper would evaluate the likelihoods of the CC, CT, and TT genotypes for the sample at this site, choose the most likely configuration, and generate a VCF record containing the genotype call and the likelihoods for the three genotype configurations.

As a concrete example command line, you can genotype the master.vcf file using in the bundle sample NA12878 with the following command:

java -Xmx2g -jar dist/GenomeAnalysisTK.jar \
-T UnifiedGenotyper \
-R bundle/b37/human_g1k_v37.fasta \
-I bundle/b37/NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam \
-alleles master.vcf \
-L master.vcf \
-gt_mode GENOTYPE_GIVEN_ALLELES \
-out_mode EMIT_ALL_SITES \
-stand_call_conf 0.0 \
-glm BOTH \
-G none \

The -L master.vcf argument tells the UG to only genotype the sites in the master file. If you don't specify this, the UG will genotype the master sites in GGA mode, but it will also genotype all other sites in the genome in regular mode.

The last item,-G ` prevents the UG from computing annotations you don't need. This command produces something like the following output:

##fileformat=VCFv4.0
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA12878
20      9999996     .       A       ACT         4576.19 .       .   GT:DP:GQ:PL     1/1:76:99:4576,229,0
20      10000000        .       T       G           0       .       .       GT:DP:GQ:PL     0/0:79:99:0,238,3093
20      10000211        .       C       T       857.79  .       .   GT:AD:DP:GQ:PL  0/1:28,27:55:99:888,0,870
20      10000598        .       T       A           1800.57 .       .   GT:AD:DP:GQ:PL  1/1:0,48:48:99:1834,144,0
20      10001436        .       A       AGG,AGGCT       1921.12 .       .   GT:DP:GQ:PL     0/2:49:84.06:1960,2065,0,2695,222,84

Several things should be noted here:

  • The genotype likelihoods calculation evolves, especially for indels, the exact results of this command will change.
  • The command will emit sites that are hom-ref in the sample at the site, but the -stand_call_conf 0.0 argument should be provided so that they aren't tagged as "LowQual" by the UnifiedGenotyper.
  • The filtered site 10000117 in the master.vcf is not genotyped by the UG, as it doesn't pass filters and so is considered bad by the GATK UG. If you want to determine the genotypes for all sites, independent on filtering, you must unfilter all of your records in master.vcf, and if desired, restore the filter string for these records later.

This genotyping command can be performed independently per sample, and so can be parallelized easily on a farm with one job per sample, as in the following:

foreach sample in samples:
  run UnifiedGenotyper command above with -I $sample.bam -o $sample.vcf
end

(Optional) Merging the sample VCFs together

You can use a similar command for CombineVariants above to merge back together all of your single sample genotyping runs. Suppose all of my UnifiedGenotyper jobs have completed, and I have VCF files named sample1.vcf, sample2.vcf, to sampleN.vcf. The single command:

java -jar dist/GenomeAnalysisTK.jar -T CombineVariants -R human_g1k_v37.fasta -V:sample1 sample1.vcf -V:sample2 sample2.vcf [repeat until] -V:sampleN sampleN.vcf -o combined.vcf

General notes

  • Because the GATK uses dynamic downsampling of reads, it is possible for truly marginal calls to change likelihoods from discovery (processing the BAM incrementally) vs. genotyping (jumping into the BAM). Consequently, do not be surprised to see minor differences in the genotypes for samples from discovery and genotyping.
  • More advanced users may want to consider group several samples together for genotyping. For example, 100 samples could be genotyped in 10 groups of 10 samples, resulting in only 10 VCF files. Merging the 10 VCF files may be faster (or just easier to manage) than 1000 individual VCFs.
  • Sometimes, using this method, a monomorphic site within a batch will be identified as polymorphic in one or more samples within that same batch. This is because the UnifiedGenotyper applies a frequency prior to determine whether a site is likely to be monomorphic. If the site is monomorphic, it is either not output, or if EMIT_ALL_SITES is thrown, reference genotypes are output. If the site is determined to be polymorphic, genotypes are assigned greedily (as of GATK-v1.4). Calling single-sample reduces the effect of the prior, so sites which were considered monomorphic within a batch could be considered polymorphic within a sub-batch.
Post edited by Geraldine_VdAuwera on

Comments

  • ledwardsledwards LondonPosts: 14Member

    Do the original vcf files that one wants to merge need to be generated from GATK i.e. not from say DiBayes. If not - do they need to be run using Unifed Genotyper? Will it confuse/invalidate results if one mixes and matches input vcf caller with regenotyping caller?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,635Administrator, GATK Developer admin

    The basic requirement is that the VCFs be valid according to the specification, i.e. properly formatted and containing the required fields. So in principle you can mix and match. However, keep in mind that different callers may use different annotations or different scalings, so it may be difficult to do a joint analysis on data called by different programs. Analyses like variant recalibration, for example, should never be run on data that originate from separate calling runs. But you can run tools like VariantEval on mix & matched data.

    Geraldine Van der Auwera, PhD

  • sletortsletort francePosts: 22Member

    Maybe I did not understand everything, I follow the commands, but at the end, some sites in master.vcf are not in outfile. As no data for those sites seem available in my bam file, don't they should be output with something like . . ./. ?

    $ grep ^chr18 master.vcf | grep -w 117241
    chr18   117241  .       C       T       11.12   LowQual;PFMFilter       set=FilteredInAll
    $ grep ^chr18 out.vcf | grep -w 11724
    # nothing
    
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,635Administrator, GATK Developer admin

    Which commands did you run? There are several different possibilities mentioned in the document, so I need to see the exact command lines in order to help you.

    Geraldine Van der Auwera, PhD

  • sletortsletort francePosts: 22Member

    Once my master file is created, I simply ran the UnifiedGenotyper command with -alleles $master -L $master and all the options mentionned plus -rf BadCigar.

    ##GATKCommandLine=<ID=UnifiedGenotyper,Version=3.2-2-gec30cee,Date="Wed Sep 10 09:52:44 CEST 2014",Epoch=1410335564572,CommandLineOptions="analysis_type=UnifiedGenotyper input_file=[set1/G469_CP_B00G6T0_C3FFHACXX_HG19_MERGE_PE_7.reliable.bam] showFullBamList=false read_buffer_size=null phone_home=AWS gatk_key=null tag=NA read_filter=[BadCigar] intervals=[true_master.vcf] excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=/PUBLIC_DATA/GENERATED/GenomeRef/chr_chrom_1kg_human_h37.fasta nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 maxRuntimeUnits=MINUTES downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=250 baq=OFF baqGapOpenPenalty=40.0 refactor_NDN_cigar_string=false fix_misencoded_quality_scores=false allow_potentially_misencoded_quality_scores=false useOriginalQualities=false defaultBaseQualities=-1 performanceLog=null BQSR=null quantize_quals=0 disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6 globalQScorePrior=-1.0 validation_strictness=SILENT remove_program_records=false keep_program_records=false sample_rename_mapping_file=null unsafe=null disable_auto_index_creation_and_locking_when_reading_rods=false num_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false variant_index_type=DYNAMIC_SEEK variant_index_parameter=-1 logging_level=INFO log_to_file=null help=false version=false genotype_likelihoods_model=BOTH pcr_error_rate=1.0E-4 computeSLOD=false pair_hmm_implementation=LOGLESS_CACHING min_base_quality_score=17 max_deletion_fraction=0.05 min_indel_count_for_genotyping=5 min_indel_fraction_per_sample=0.25 indelGapContinuationPenalty=10 indelGapOpenPenalty=45 indelHaplotypeSize=80 indelDebug=false ignoreSNPAlleles=false allReadsSP=false ignoreLaneInfo=false reference_sample_calls=(RodBinding name= source=UNBOUND) reference_sample_name=null min_quality_score=1 max_quality_score=40 site_quality_prior=20 min_power_threshold_for_calling=0.95 annotateNDA=false heterozygosity=0.001 indel_heterozygosity=1.25E-4 standard_min_confidence_threshold_for_calling=0.0 standard_min_confidence_threshold_for_emitting=30.0 max_alternate_alleles=6 input_prior=[] sample_ploidy=2 genotyping_mode=GENOTYPE_GIVEN_ALLELES alleles=(RodBinding name=alleles source=true_master.vcf) contamination_fraction_to_filter=0.0 contamination_fraction_per_sample_file=null p_nonref_model=EXACT_INDEPENDENT exactcallslog=null output_mode=EMIT_ALL_SITES allSitePLs=false dbsnp=(RodBinding name= source=UNBOUND) comp=[] out=org.broadinstitute.gatk.engine.io.stubs.VariantContextWriterStub no_cmdline_in_header=org.broadinstitute.gatk.engine.io.stubs.VariantContextWriterStub sites_only=org.broadinstitute.gatk.engine.io.stubs.VariantContextWriterStub bcf=org.broadinstitute.gatk.engine.io.stubs.VariantContextWriterStub onlyEmitSamples=[] debug_file=null metrics_file=null annotation=[] excludeAnnotation=[] filter_reads_with_N_cigar=false filter_mismatching_base_and_quals=false filter_bases_not_stored=false">
    
  • sletortsletort francePosts: 22Member

    to disambiguate discordance with filenames

    $ grep ^chr18 true_master.vcf | grep -w 117241
    chr18   117241  .       C       T       11.12   LowQual;PFMFilter       set=FilteredInAll
    $ grep ^chr18 new_B00G6T0.vcf | grep -w 117241
    # nothing
    

    chr_chrom_1kg_human_h37.fasta ref is your fasta ref where chromosome name have been prefixed with 'chr'.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,635Administrator, GATK Developer admin

    The default behavior may be to ignore sites marked as filtered in the master filter. Check the options for one to include filtered sites.

    Geraldine Van der Auwera, PhD

  • sletortsletort francePosts: 22Member

    You are right, if I change the filter column (for '.') and all the missing positions came back. But I didn't find options to include those site without modifying the master. I tried to set stand_emit_conf to 0, => no change. I tried to set min_base_quality_score to 0 => only the PL field change.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,635Administrator, GATK Developer admin

    Ah. my bad, I was thinking of a different tool. Unfortunately there is no option for UG to also include filtered sites, so you can only use this for passing or unfiltered variants.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.