The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Did you remember to?


1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

Did we ask for a bug report?


Then follow instructions in Article#1894.

Formatting tip!


Surround blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block.
Powered by Vanilla. Made with Bootstrap.
Picard 2.9.0 is now available. Download and read release notes here.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

Downsampling

SophiaSophia Member Posts: 50
edited July 2012 in Ask the GATK team

Dear all,

I would like to ask you about some inconsistencies (maybe!) that I am observing regarding the downsampling of reads in GATK.
This is the first time I am dealing with real high coverage (several 1000/sample) and it starts to matter for me to know how to adjust these parameters.
All of the below is from GenomeAnalysisTK-1.6.596/GenomeAnalysisTKLite.jar.

  1. I was anticipating the need for some tuning, so I read the CommandLine GATK documentation, where three options related to this issue are listed:
    --downsample_to_coverage
    --downsample_to_fraction
    --downsampling_type
    All of them have NA as default value, so I assumed, that all my reads would count for the variant calling.

  2. In the documentation for the UnifiedGenotyper, there are no options related to downsampling.
    My command:
    java -jar GenomeAnalysisTKLite.jar -T UnifiedGenotyper -R ref.fa -I 1.bam -I 2.bam -o GATK.vcf -glm BOTH -stand_call_conf 1 -stand_emit_conf 1

  3. Neither are in the VariantAnnotator which I used subsequently on the UG output.
    My command:
    java -jar GenomeAnalysisTKLite.jar -T VariantAnnotator -R ref.fa -I 1.bam -I 2.bam -o GATK_anno.vcf --variant GATK.vcf -A DepthPerAlleleBySample -A BaseCounts -A AlleleBalanceBySample -A AlleleBalance -A DepthOfCoverage -A SampleList

Nevertheless, my variant-called and annotated vcf file features the following:

  1. In the header, I see the full set of options used for the UG and the annotations:

UnifiedGenotyper="analysis_type=UnifiedGenotyper input_file=[F323/lib.218C/C0TV2ACXX_7_12_1.fastq_process_MQ35.bam, F325/lib.219C/C0TV2ACXX_7_13_1.fastq_process_MQ35.bam] read_buffer_size=null phone_home=STANDA
RD gatk_key=null read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=/project/production/Indexes/samtools/hsapiens_v37_chrM.fa no
nDeterministicRandomSeed=false downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=250 baq=OFF baqGapOpenPenalty=40.0 performanceLog=null useOriginalQualities=false BQSR=null
[etc]

VariantAnnotator="analysis_type=VariantAnnotator input_file=[F323/lib.218C/C0TV2ACXX_7_12_1.fastq_process_MQ35.bam, F325/lib.219C/C0TV2ACXX_7_13_1.fastq_process_MQ35.bam] read_buffer_size=null phone_home=STANDA
RD gatk_key=null read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=/project/production/Indexes/samtools/hsapiens_v37_chrM.fa no
nDeterministicRandomSeed=false downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=1000 baq=OFF baqGapOpenPenalty=40.0 performanceLog=null useOriginalQualities=false BQSR=null
[etc]

  1. My INFO and FORMAT fields for one position look like this:

ABHom=0.997;AC=4;AF=1.00;AN=4;BaseCounts=1990,4,5,0;BaseQRankSum=-0.059;DP=2000;DS;Dels=0.00;FS=0.000;HaplotypeScore=6.1837;MLEAC=4;MLEAF=1.00;MQ=35
.00;MQ0=0;MQRankSum=1.551;OND=5.000e-03;QD=33.99;ReadPosRankSum=1.133;SB=-9.057e-03;Samples=F323,F325

GT:AD:DP:GQ:PL

1/1:1,998:250:99:8592,700,0

1/1:4,992:250:99:8405,719,0

Due to the different thresholds applied by UG and VA, I have now these inconsistenf values for per-sample AD and DP.

Is this supposed to work like this, or am I using a "non-cannonical" sequence of operations by following up the UG with the VA?
Is there a way to switch OFF downsampling?

Many thanks as always for your comments!

cheers,

Sophia

Tagged:

Best Answer

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie Posts: 11,388 admin
    Accepted Answer

    Hi Sophia,

    Downsampling is performed by the GATK engine, so the command-line argument that controls it lives in the
    CommandLineGATK

    The reason you're seeing different behaviors is that some walkers like the UG override the engine's default downsampling setting; for example the UG sets downsampling to 250 by default (sorry that wasn't specified, we'll make sure to add a note about that in the UG docs in the future). But you can manually override this by using the -dcov CommandLineGATK argument (as documented in the link above).

    Geraldine Van der Auwera, PhD

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie Posts: 11,388 admin
    Accepted Answer

    Hi Sophia,

    Downsampling is performed by the GATK engine, so the command-line argument that controls it lives in the
    CommandLineGATK

    The reason you're seeing different behaviors is that some walkers like the UG override the engine's default downsampling setting; for example the UG sets downsampling to 250 by default (sorry that wasn't specified, we'll make sure to add a note about that in the UG docs in the future). But you can manually override this by using the -dcov CommandLineGATK argument (as documented in the link above).

    Geraldine Van der Auwera, PhD

  • Mark_DePristoMark_DePristo Broad InstituteMember Posts: 153 admin

    Turning off downsampling exposes you to major performance problems in regions of excess coverage. We don't recommend you play with this parameter

    --
    Mark A. DePristo, Ph.D.
    Co-Director, Medical and Population Genetics
    Broad Institute of MIT and Harvard

  • SophiaSophia Member Posts: 50
    edited August 2012

    I tried the following two settings now for both the UG and the VA:

    --downsampling_type BY_SAMPLE --downsample_to_coverage 1000

    This produced the expected outcome, namely a global DP of 1000xn(samples) in the INFO field, 1000 in the per-sample DP field and AD per sample numbers that added up to 1000.

    --downsampling_type ALL_READS --downsample_to_coverage 10000

    Here, I was expecting GATK to downsampling all reads from all samples to a total of 10.000 per position. Instead, only positions where any single sample reached a depth of 10.000 were printed, e.g.:
    AC=2;AF=1.00;AN=2;BaseQRankSum=6.858;DP=170360;DS;Dels=0.00;FS=8.789;HaplotypeScore=6809.3095;MLEAC=2;MLEAF=1.00;MQ=35.00;MQ0=0;MQRankSum=1.569;QD=0
    .19;ReadPosRankSum=1.513;SB=-3.277e+04 GT:AD:DP:GQ:PL ** ./. ** 1/1:63,170165:170355:99:32767,32767,0 ./.

    @Mark_DePristo: I am aware of the problems excess coverage can cause; in this case, though, I am dealing with a very short reference sequence, and am interested in mosaik-like variant configurations.

Sign In or Register to comment.