We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Calling variants on whole exome and whole genome samples together?

jaredmejaredme Member
edited January 2013 in Ask the GATK team


I have 15 affected samples. 2 are whole exome and 13 are whole genome. They have already been realigned on a single-sample level and had BQSR performed. I am contemplating running UnifiedGenotyper on all 15 samples together because we would like to compare the calls across the samples (especially in the coding regions). I am aware that there would be a large number of variant calls in the whole genome samples that would have little to no coverage in the exome samples. I haven't been able to find any posts that say you should or shouldn't run whole genome and exome samples through UnifiedGenotyper together. Are there any reasons why this should be discouraged?

Also, assuming I do perform multi-sample calling across all 15 samples, would it be ok to run that multi-sample VCF file through VQSR?


Post edited by Geraldine_VdAuwera on

Best Answer


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Jared,

    That's a rather different approach than what we have experience with. Classically we would call the whole genomes and the exomes separately, then compare callsets with the variant evaluation tools.

    As we've never tried the approach you suggest, I can't really comment one way or the other, except to say there is no major obstacle to doing it that I can think of. If you try this, please do let us know how it turns out, so we can share the merits or drawbacks with the user community. Thanks!

  • jaredmejaredme Member

    Thanks for your answer. I ended up separating the exome and whole genome samples before performing multi-sample calling so I wouldn't risk confusing VQSR afterwards. I also had another project where I was working with exome samples from 2 different versions of a capture kit. I separated those as well for the same reason.

  • SamirSamir Member ✭✭

    Hi everyone,

    I like to revisit this question and see if there is any change in suggestion over last five years. In my case, I am working on canine genome WGS and WES tumor/normal pairs. I have them aligned using GATK3 best practices. Given we have 30X WGS and 80X WES, I like to see if I can merge analysis-ready bam files from two platforms and then call variant calls. That may allow us to increase confidence for variant calls in captured regions with better read support. I can limit variant calling only for exome captured regions in order to minimize violating assumptions regarding local realignment by MuTect2 (GATK3.8) during somatic variant calling.

    I wonder if it is still better to separately call variants and then merge variants based on variant QC filters. I already have calls from separate files but yet to compare with calls from merged bam sets. I am hoping to use merged bam approach to rescue variant calls in regions where a base is poorly covered (2-4 reads) in either of WGS and WES bam file.

    Also, would use of MuTect v1 or v2 would make any difference in calling variant calls on merged tumor/normal pair? My guess is variant calls from mutect v1 may have better confidence than mutect v2 given the latter relies on local realignment. I can update more on comparison of calls but meantime, good to get your comments and red flag(s) if any!


    Issue · Github
    by shlee

    Issue Number
    Last Updated
    Closed By
  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @Samir,

    I am consulting with the developer on your questions.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @Samir,

    For Mutect2/MuTect2 (M2), it seems we have different recommendations than that above. This is what our developer says:

    M2 filtering for WES and WGS is similar enough that combining the reads in the single run of M2 is probably worthwhile. In GATK4 this requires the sample name in both bam headers to be the same. It will certainly help sensitivity.

    Depending on how important precision is, you might want to run the WGS alone with a low tumor-lod-to-emit and tumor-lod and then filter out calls that don't have at least some support in the WGS alone. Of course, with the lowish coverage in the WGS this might be too conservative. It depends on the desired trade-off between sensitivity and precision.

    But you don't have to worry about this extra step. I don't think that the precision of the combined M2 run would be worse than a 110x exome.

    Finally, I don't see any reason a priori to use M1 over M2 for this, but I have never actually done it before.

    I hope this is helpful to you.

  • SamirSamir Member ✭✭

    hi @shlee,

    That was very helpful and so as this document in running M2. Thanks!

    I have now merged WGS and exome bam files for tumor and normal. While merging, I kept RGSM in identical for each of merged tumor and normal, e.g., merged_sample1_tumor, merged_sample1_normal. I used samtools reheader over Picard AddorReplaceReadGroups as I believe that Picard has no option to change only RGSM, leaving RGID, RGLB and other tags intact. Then, I merged wgs+exome for tumor and normal separately using Picard MergeSamFiles to get analysis-ready merged tumor and normal bam files. Then, I ran GATK4 M2 in both, tumor-normal and panel of normal mode (using normal sample) to get somatic snv/indels and sample-wise PONs, respectively. I will then run CreateSomaticPanelOfNormals to get final PON which I can use to recall tumor/normal and tumor-only somatic variants by adding --normal_panel PON.file option.

    Once I have raw somatic variants, I will run FilterMutectCalls. This is where things may get tricky as I do not have population level resource for canine genome (besides dbsnp for canfam 3.1 genome ?). For now, I plan to generate contamination.table using following commands:

     gatk GetPileupSummaries \
       -I tumor.bam \
       -V common_biallelic.vcf.gz \
       -O pileups.table

    I have to go through GATK docs on how to get biallelic calls from raw calls and see if there is any INFO/FORMAT filter for that.

    and then,

    gatk CalculateContamination \
    -I 7_tumor_getpileupsummaries.table \
    -O 8_tumor_calculatecontamination.table

    and, then use...

    gatk FilterMutectCalls \
    -V somatic_m2.vcf.gz \
    --contamination-table tumor_calculatecontamination.table \
    -O 9_somatic_oncefiltered.vcf.gz

    Does contamination table is a merged across all study cohort or tumor-normal specific ?

    Please comment if workflow needs improvement? I will keep adding a few other filters downstream but for now, this is a minimal set of commands I am running.


  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @Samir,

    Perhaps you can contact @sutturka (see this discussion) for access to the Broad's 435 dog germline callset VCF. We will get back to you on your other questions.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @Samir you can get biallelic calls with SelectVariants:

    java -jar $gatk SelectVariants -V variants.vcf -restrict-alleles-to BIALLELIC -O biallelic.vcf

    If you want biallelic SNPs then you would add -select-type SNP.

    Contamination tables are specific to each tumor sample. Also, in your code above you run CalculateContamination in tumor-only mode, which works, but since you have normals you may as well use matched normal mode. To do so, run GetPileupSummaries on the normals and pass them to CalculateContamination with --matched-normal normal_getpileupsummaries.table.

  • SamirSamir Member ✭✭

    Great and many thanks! This pretty much resolves my current issue.

Sign In or Register to comment.