Downsampling

SophiaSophia Posts: 35Member
edited July 2012 in Ask the GATK team

Dear all,

I would like to ask you about some inconsistencies (maybe!) that I am observing regarding the downsampling of reads in GATK. This is the first time I am dealing with real high coverage (several 1000/sample) and it starts to matter for me to know how to adjust these parameters. All of the below is from GenomeAnalysisTK-1.6.596/GenomeAnalysisTKLite.jar.

  1. I was anticipating the need for some tuning, so I read the CommandLine GATK documentation, where three options related to this issue are listed: --downsample_to_coverage --downsample_to_fraction --downsampling_type All of them have NA as default value, so I assumed, that all my reads would count for the variant calling.

  2. In the documentation for the UnifiedGenotyper, there are no options related to downsampling. My command: java -jar GenomeAnalysisTKLite.jar -T UnifiedGenotyper -R ref.fa -I 1.bam -I 2.bam -o GATK.vcf -glm BOTH -stand_call_conf 1 -stand_emit_conf 1

  3. Neither are in the VariantAnnotator which I used subsequently on the UG output. My command: java -jar GenomeAnalysisTKLite.jar -T VariantAnnotator -R ref.fa -I 1.bam -I 2.bam -o GATK_anno.vcf --variant GATK.vcf -A DepthPerAlleleBySample -A BaseCounts -A AlleleBalanceBySample -A AlleleBalance -A DepthOfCoverage -A SampleList

Nevertheless, my variant-called and annotated vcf file features the following:

  1. In the header, I see the full set of options used for the UG and the annotations:

UnifiedGenotyper="analysis_type=UnifiedGenotyper input_file=[F323/lib.218C/C0TV2ACXX_7_12_1.fastq_process_MQ35.bam, F325/lib.219C/C0TV2ACXX_7_13_1.fastq_process_MQ35.bam] read_buffer_size=null phone_home=STANDA RD gatk_key=null read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=/project/production/Indexes/samtools/hsapiens_v37_chrM.fa no nDeterministicRandomSeed=false downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=250 baq=OFF baqGapOpenPenalty=40.0 performanceLog=null useOriginalQualities=false BQSR=null [etc]

VariantAnnotator="analysis_type=VariantAnnotator input_file=[F323/lib.218C/C0TV2ACXX_7_12_1.fastq_process_MQ35.bam, F325/lib.219C/C0TV2ACXX_7_13_1.fastq_process_MQ35.bam] read_buffer_size=null phone_home=STANDA RD gatk_key=null read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=/project/production/Indexes/samtools/hsapiens_v37_chrM.fa no nDeterministicRandomSeed=false downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=1000 baq=OFF baqGapOpenPenalty=40.0 performanceLog=null useOriginalQualities=false BQSR=null [etc]

  1. My INFO and FORMAT fields for one position look like this:

ABHom=0.997;AC=4;AF=1.00;AN=4;BaseCounts=1990,4,5,0;BaseQRankSum=-0.059;DP=2000;DS;Dels=0.00;FS=0.000;HaplotypeScore=6.1837;MLEAC=4;MLEAF=1.00;MQ=35 .00;MQ0=0;MQRankSum=1.551;OND=5.000e-03;QD=33.99;ReadPosRankSum=1.133;SB=-9.057e-03;Samples=F323,F325

GT:AD:DP:GQ:PL

1/1:1,998:250:99:8592,700,0

1/1:4,992:250:99:8405,719,0

Due to the different thresholds applied by UG and VA, I have now these inconsistenf values for per-sample AD and DP.

Is this supposed to work like this, or am I using a "non-cannonical" sequence of operations by following up the UG with the VA? Is there a way to switch OFF downsampling?

Many thanks as always for your comments!

cheers,

Sophia

Post edited by Sophia on
Tagged:

Best Answer

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,822Administrator, GATK Developer admin
    Answer ✓

    Hi Sophia,

    Downsampling is performed by the GATK engine, so the command-line argument that controls it lives in the CommandLineGATK

    The reason you're seeing different behaviors is that some walkers like the UG override the engine's default downsampling setting; for example the UG sets downsampling to 250 by default (sorry that wasn't specified, we'll make sure to add a note about that in the UG docs in the future). But you can manually override this by using the -dcov CommandLineGATK argument (as documented in the link above).

    Geraldine Van der Auwera, PhD

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,822Administrator, GATK Developer admin
    Answer ✓

    Hi Sophia,

    Downsampling is performed by the GATK engine, so the command-line argument that controls it lives in the CommandLineGATK

    The reason you're seeing different behaviors is that some walkers like the UG override the engine's default downsampling setting; for example the UG sets downsampling to 250 by default (sorry that wasn't specified, we'll make sure to add a note about that in the UG docs in the future). But you can manually override this by using the -dcov CommandLineGATK argument (as documented in the link above).

    Geraldine Van der Auwera, PhD

  • Mark_DePristoMark_DePristo Posts: 153Administrator, GATK Developer admin

    Turning off downsampling exposes you to major performance problems in regions of excess coverage. We don't recommend you play with this parameter

    -- Mark A. DePristo, Ph.D. Co-Director, Medical and Population Genetics Broad Institute of MIT and Harvard

  • SophiaSophia Posts: 35Member
    edited August 2012

    I tried the following two settings now for both the UG and the VA:

    --downsampling_type BY_SAMPLE --downsample_to_coverage 1000

    This produced the expected outcome, namely a global DP of 1000xn(samples) in the INFO field, 1000 in the per-sample DP field and AD per sample numbers that added up to 1000.

    --downsampling_type ALL_READS --downsample_to_coverage 10000

    Here, I was expecting GATK to downsampling all reads from all samples to a total of 10.000 per position. Instead, only positions where any single sample reached a depth of 10.000 were printed, e.g.: AC=2;AF=1.00;AN=2;BaseQRankSum=6.858;DP=170360;DS;Dels=0.00;FS=8.789;HaplotypeScore=6809.3095;MLEAC=2;MLEAF=1.00;MQ=35.00;MQ0=0;MQRankSum=1.569;QD=0 .19;ReadPosRankSum=1.513;SB=-3.277e+04 GT:AD:DP:GQ:PL ** ./. ** 1/1:63,170165:170355:99:32767,32767,0 ./.

    @Mark_DePristo: I am aware of the problems excess coverage can cause; in this case, though, I am dealing with a very short reference sequence, and am interested in mosaik-like variant configurations.

    Post edited by Sophia on
Sign In or Register to comment.