Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

ReduceReads Tool

dvuzmandvuzman Harvard new research building, BostonMember

I applied the ReduceReads tool on a tested bam file. Comparison of the reduced and unreduced vcf files

(before and after VQSR/VEP) reveals that *.reduced.vcf is ~6 times smaller than the *.regular.vcf.

And also, the GATK QUAL, FILTER, etc scores are different for SNPs that present in both files.

I would very much appreciate your input on what are the causes to the differences and possible solutions.

Here is an example for the different GATK scores for the same SNP:

chr1 16495 . G C 227.77 SNP_QD;VQSRTrancheSNP99.90to100.00 AC=1;AF=0.500;AN=2;BaseQRankSum=-2.281;ClippingRankSum=2.688;DP=134;FS=1.486;MLEAC=1;MLEAF=0.500;MQ=50.63;MQ0=0;MQRankSum=-0.911;NEGATIVE_TRAIN_SITE;QD=1.70;ReadPosRankSum=1.122;VQSLOD=-9.347e+00;culprit=DP;CSQ=C|ENSG00000223972|ENST00000456328|Transcript|downstream_gene_variant||||||rs141130360|||||||||2086||YES|DDX11L1|HGNC||||processed_transcript||||||||,C|ENSG00000227232|ENST00000488147|Transcript|intron_variant&nc_transcript_variant||||||rs141130360||||8/10||||||||WASH7P|HGNC||||unprocessed_pseudogene|||||||| GT:AD:GQ:PL 0/1:98,36:99:256,0,4523

chr1 16495 . G C 232.77 SNP_QD;VQSRTrancheSNP99.00to99.90 AC=1;AF=0.500;AN=2;BaseQRankSum=-2.367;ClippingRankSum=1.912;DP=136;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=50.64;MQ0=0;MQRankSum=-1.037;NEGATIVE_TRAIN_SITE;QD=1.71;ReadPosRankSum=0.743;VQSLOD=-2.309e+00;culprit=QD;CSQ=C|ENSG00000223972|ENST00000456328|Transcript|downstream_gene_variant||||||rs141130360|||||||||2086||YES|DDX11L1|HGNC||||processed_transcript||||||||,C|ENSG00000227232|ENST00000488147|Transcript|intron_variant&nc_transcript_variant||||||rs141130360||||8/10||||||||WASH7P|HGNC||||unprocessed_pseudogene|||||||| GT:AD:GQ:PL 0/1:99,37:99:261,0,4570

Here is the general downstream pipeline that I'm using:
1. Running HaplotypeCaller

2a. VQSR: recalibrate

2a. VQSR: apply

  1. Select only SNPs

  2. Select only indels

  3. Filter SNPs

  4. Filter INDELs

  5. Combine SNPs and INDELs

  6. Annotation with VEP (works with reduced reads, although it is non-GATK tool).

I wish you a wonderful new 2014 year.


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Dana,

    Happy new year to you too!

    Can you please tell me what version you used and what were your command lines? Also, can you clarify whether you recalibrated SNPs and indels together or separately? Both steps (recalibrate/apply) need to be done separately per type of variant.

  • dvuzmandvuzman Harvard new research building, BostonMember

    Thank you so much for your prompt reply, Geraldine!

    I'm using GATK version 2.5.2-0, which is the latest commercially available (I used this tool as a research consultant in an industrial company). Here is the command line:

    java -Xmx8000m –jar /GenomeAnalysisTK-2013.2-2.5.2-0-g3ae1219/GenomeAnalysisTK.jar –T ReduceReads \ –R g1k_v37.fasta \
    –I LP6005181-DNA_B01.complete.chr1.bam \
    –o LP6005181-DNA_B01.complete.reduced.chr1.bam

    SNPs and indels are recalibrated separately with two separate steps (recalibrate/apply).

  • dvuzmandvuzman Harvard new research building, BostonMember

    Geraldine, I would like to correct the command that I used. Here is the correct one (the same as mentioned above, but with -L argument.

    java –jar /GenomeAnalysisTK-2013.2-2.5.2-0-g3ae1219/GenomeAnalysisTK.jar –T ReduceReads \ –R g1k_v37.fasta \
    –I LP6005181-DNA_B01.complete.bam \
    –o LP6005181-DNA_B01.complete.reduced.bam
    -L chr1.bed

    The input for both LP6005181-DNA_B01.complete.reduced.chr1.bam and LP6005181-DNA_B01.complete.chr1.bam was LP6005181-DNA_B01.complete.bam
    To get the reduced bam, the command above was used, and to get just the chr1 reads, the PrintReads with the -L chr1.bed was used.

    The resulted LP6005181-DNA_B01.complete.reduced.chr1.bam and LP6005181-DNA_B01.complete.chr1.bam were further processed by HaplotypeCaller which produced *.reduced.vcf which is ~6 times smaller than the *.regular.vcf.

    I hope that this is more clear now.

    Thank you,

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Dana,

    Sorry for the response delay, the snowstorm disrupted my schedule. FYI if you are using GATK on behalf of a licensed customer you can get faster support from Appistry, who handles commercial support.

    Regarding your comparison of reduced v. unreduced calls, a couple of points.

    1. It is normal for the QUAL score to change slightly, since ReduceReads will have done some downsampling and filtering that can affect which reads end up being used for calling. As long as the differences you observe are marginal, it is no cause for concern.

    2. The differences in FILTER are likely due to instability in the VQSR modeling process because you are giving it insufficient data. I understand that you are processing a single chromosome of a single sample at a time for testing purposes, but be aware that VQSR will not work properly on such a small amount of data. There is effectively no good way to do small-scale testing of VQSR.

    3. The dramatic difference you're seeing in the number of calls made by HC on the two files is more alarming, although it could be an issue of data quality and calling thresholds. The question here is whether the "missing" calls look like they are real variants or just junk in the unreduced callset.

    In any case you might want to ask Appistry (or have your client ask them) to confirm which version is the latest commercially available, because I'd be surprised if they weren't further along than 2.5. There were quite a few important issues with ReduceReads in 2.5 that were fixed in subsequent versions (see release notes for details).

    Note that we are working to change the multisample calling pipeline to no longer need ReduceReads.

  • dvuzmandvuzman Harvard new research building, BostonMember

    Dear Geraldine,
    Thank you very much for your answer.
    All the best,

Sign In or Register to comment.