Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

HaplotypeCaller only generates header when called with recalibrated/dedupped BAM files

nhvanlienhvanlie Posts: 9Member

Hello,

I am having trouble calling variants using Haplotype Caller on simulated exome reads. I have been able to call reasonable-looking variants on the exome (simulated with dwgsim) with HaplotypeCaller before running it through the Best Practices Pre-Processing pipeline. The pre-processed data worked fine with UnifiedGenotyper but with HaplotypeCaller, though it runs without errors and seems to walk across the genome, only outputs a VCF header. I have tried calling variants with and without using -L to provide the exome regions (as recommended in this forum post: http://gatkforums.broadinstitute.org/discussion/1681/expected-file-size-haplotype-caller) but this hasn't made a difference - when we run the command with the pre-processed BAMs, we only get a VCF header. Everything has been tested with both 2.4-7 and 2.4-9.

Any help or guidance would be greatly appreciated!

Command Used for HaplotypeCaller:

java -Xmx4g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta -I exome.realigned.dedup.recal.bam -o exome.raw.vcf -D dbsnp_137.hg19.vcf -stand_emit_conf 10 -rf BadCigar -L Illumin_TruSeq.bed --logging_level DEBUG

Commands Used for pre-processing (run in sequence using a Perl script):

java -Xmx16g -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -nt 8 -R ucsc.hg19.fasta -I exome.bam -o exome.intervals -known dbsnp_137.hg19.vcf

java -Xmx4g -jar GenomeAnalysisTK.jar -T IndelRealigner -R ucsc.hg19.fasta -I exome.bam -o exome.realigned.bam -targetIntervals intervals.bam -known dbsnp_137.hg19.vcf

java -Xmx16g -jar MarkDuplicates.jar I=exome.realigned.bam METRICS_FILE=exome.dups O=exome.realigned.dedup.bam

samtools index exome.realigned.dedup

java -Xmx4g -jar GenomeAnalysisTK.jar -T BaseRecalibrator -nct 8 -R ucsc.hg19.fasta -I exome.realigned.dedup.bam -o exome.recal_data.grp -knownSites dbsnp_137.hg19.vcf -cov ReadGroupCovariate -cov ContextCovariate -cov CycleCovariate -cov QualityScoreCovariate

java -Xmx4g -jar GenomeAnalysisTK.jar -T PrintReads -nct 8 -R ucsc.hg19.fasta -I exome.realigned.dedup.bam -BQSR exome.recal_data.grp -baq CALCULATE_AS_NECESSARY -o exome.realigned.dedup.recal.bam

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,984Administrator, GATK Developer admin

    Hmm, we do have a known issue related to using both -baq and -BQSR together, as you are in the last step of pre-processing: the BAQ'ing is being done first in the engine (on the original bases), but in fact it should be done after recalibration with the BQSR. Try again using only -BQSR with your PrintReads step and then use -BAQ in your variant calling step. Let me know if that solves your problem.

    FYI this baq/bqsr issue is fixed in our internal dev version, but the fix will only be available when we release version 2.5 because it's too complex to patch onto 2.4.

    Geraldine Van der Auwera, PhD

  • nhvanlienhvanlie Posts: 9Member

    Thanks @Geraldine_VdAuwera for the quick response!

    I tried eliminating the -baq from the Print Reads command, instead calling:

    java -Xmx4g -Djava.io.tmpdir=tmpdir/ -jar $gatk_dir/GenomeAnalysisTK.jar -T PrintReads -nct 8 -R ucsc.hg19.fasta -I exome.dedup.bam -BQSR exome.recal_data.grp -o exome.processed.bam

    I then tried a few iterations of the HaplotypeCaller invocation, first including the -baw parameter:

    java -Xmx4g -Djava.io.tmpdir=tmpdir/ -jar $gatk_dir/GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta -I exome.processed.bam -o ./variants.raw.vcf -D dbsnp_137.hg19.vcf -stand_emit_conf 10 -rf BadCigar -baq CALCULATE_AS_NECESSARY -L Illumina_TruSeq.bed

    This output the error:

    ERROR MESSAGE: Invalid command line: Argument baq has a bad value: Walker cannot accept BAQ'd base qualities, and yet BAQ mode CALCULATE_AS_NECESSARY was requested.

    I then tried calling it without the -baq parameter, testing both with and without the -L parameter:

    java -Xmx4g -Djava.io.tmpdir=tmpdir/ -jar $gatk_dir/GenomeAnalysisTK.jar -T HaplotypeCaller --R ucsc.hg19.fasta -I exome.processed.bam -o ./variants.raw.vcf -D dbsnp_137.hg19.vcf -stand_emit_conf 10 -rf BadCigar -L Illumina_TruSeq.bed

    java -Xmx4g -Djava.io.tmpdir=tmpdir/ -jar $gatk_dir/GenomeAnalysisTK.jar -T HaplotypeCaller --R ucsc.hg19.fasta -I exome.processed.bam -o ./variants.raw.vcf -D dbsnp_137.hg19.vcf -stand_emit_conf 10 -rf BadCigar

    Both of those runs completed with the same problem (only a header output).

  • nhvanlienhvanlie Posts: 9Member

    Hi @Geraldine_VdAuwera! It looks like we fixed our problem; it was cased by our simulator, dwgsim.

    Dwgsim creates a set of random reads (rather than reads mutated from the reference) where the contig is given as "rand". These are included to help evaluate the aligner and they are supposed to be ignored in the remaining steps but it seems one the pre-processing steps makes them look like viable reads to HaplotypeCaller (and coincidentally Dindel). We fixed this problem by removing all the "rand" labeled reads in the sam file and now HaplotypeCaller is working properly.

    Thank you for your help- it was running Picard ValidateSamFile validation as a quality check that helped us identify this issue.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,984Administrator, GATK Developer admin

    Glad to hear your problem is solved! And thanks for reporting your solution, as it may prove useful to other users in future. Validating files is always a good troubleshooting reflex :-)

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.